You are on page 1of 322

CONTENTS

96 年台大電機 .............................................................................................................. 3
95 年台大電機 .............................................................................................................. 8
94 年台大電機 ............................................................................................................ 13
93 年台大電機 ............................................................................................................ 18
92 年台大電機 ............................................................................................................ 22
96 年台大資工 ............................................................................................................ 24
95 年台大資工 ............................................................................................................ 28
94 年台大資工 ............................................................................................................ 34
93 年台大資工 ............................................................................................................ 38
92 年台大資工 ............................................................................................................ 42
96 年清大電機 ............................................................................................................ 44
95 年清大電機 ............................................................................................................ 49
94 年清大電機 ............................................................................................................ 54
93 年清大電機 ............................................................................................................ 58
92 年清大電機 ............................................................................................................ 63
96 年清大通訊 ............................................................................................................ 67
95 年清大通訊 ............................................................................................................ 69
94 年清大通訊 ............................................................................................................ 71
93 年清大通訊 ............................................................................................................ 73
92 年清大通訊 ............................................................................................................ 75
96 年清大資工 ............................................................................................................ 77
95 年清大資工 ............................................................................................................ 80
94 年清大資工 ............................................................................................................ 85
93 年清大資工 ............................................................................................................ 89
92 年清大資工 ............................................................................................................ 92
96 年交大資聯 ............................................................................................................ 96
95 年交大資工 .......................................................................................................... 104
94 年交大資工 .......................................................................................................... 113
93 年交大資工 .......................................................................................................... 121
92 年交大資工 .......................................................................................................... 126
96 年交大電子所 ...................................................................................................... 131
95 年交大電子所 ...................................................................................................... 133
94 年交大電子所 ...................................................................................................... 137
96 年成大資工 .......................................................................................................... 143
95 年成大資工 .......................................................................................................... 145
94 年成大資工 .......................................................................................................... 149
93 年成大資工 .......................................................................................................... 152
92 年成大資工 .......................................................................................................... 154
96 年成大電機 .......................................................................................................... 157
95 年成大電機 .......................................................................................................... 163
94 年成大電機 .......................................................................................................... 168
96 年中央電機 .......................................................................................................... 174
95 年中央電機 .......................................................................................................... 181
96 年中央資工 .......................................................................................................... 186
95 年中央資工 .......................................................................................................... 189
94 年中央資工 .......................................................................................................... 194
93 年中央資工 .......................................................................................................... 197
92 年中央資工 .......................................................................................................... 199
96 年中正資工 .......................................................................................................... 202
95 年中正資工 .......................................................................................................... 205
94 年中正資工 .......................................................................................................... 209
93 年中正資工 .......................................................................................................... 213
92 年中正資工 .......................................................................................................... 217
96 年中山電機 .......................................................................................................... 222
96 年中山資工 .......................................................................................................... 227
95 年中山資工 .......................................................................................................... 236
94 年中山資工 .......................................................................................................... 243
93 年中山資工 .......................................................................................................... 250
92 年中山資工 .......................................................................................................... 257
96 年政大資科 .......................................................................................................... 263
95 年政大資科 .......................................................................................................... 267
94 年政大資科 .......................................................................................................... 269
93 年政大資科 .......................................................................................................... 272
92 年政大資科 .......................................................................................................... 274
96 年暨南資工 .......................................................................................................... 277
95 年暨南資工 .......................................................................................................... 280
94 年暨南資工 .......................................................................................................... 283
93 年暨南資工 .......................................................................................................... 286
92 年暨南資工 .......................................................................................................... 289
96 年台師大資工 ...................................................................................................... 291
95 年台師大資工 ...................................................................................................... 293
94 年台師大資工 ...................................................................................................... 295
93 年台師大資工 ...................................................................................................... 298
92 年台師大資工 ...................................................................................................... 300
93 年彰師大資工 ...................................................................................................... 302
95 年東華資工 .......................................................................................................... 305

1
96 年台科大電子 ...................................................................................................... 309
95 年海洋資工 .......................................................................................................... 314
95 年元智資工 .......................................................................................................... 317
95 年中原資工 .......................................................................................................... 320

2
96 年台大電機

1. _____ implements the translation of a program's address space to physical


addresses.
(A) DRAM
(B) Main memory
(C) Physical memory
(D) Virtual memory

Answer: (D)

2. To track whether a page of disk has been written since it was read into the
memory, a ____ is added to the page table.
(A) valid bit
(B) tag index
(C) dirty bit
(D) reference bit

Answer: (C)

3. (Refer to the CPU architecture of Figure 1 below) Which of the following


statements is correct for a load word (LW) instruction?
(A) MemtoReg should be set to 0 so that the correct ALU output can be sent to
the register file.
(B) MemtoReg should be set to 1 so that the Data Memory output can be sent to
the register file.
(C) We do not care about the setting of MemtoReg. It can be either 0 or 1.
(D) MemWrite should be set to 1.

Answer: (B)

3
0
M
u
x
ALU
Add result 1
Add Shift PCSrc
RegDst left 2
4 Branch
MemRead
Instruction [31 26] MemtoReg
Control
ALUOp
MemWrite
ALUSrc
RegWrite

Instruction [25 21] Read


PC Read register 1
address Read
Instruction [20 16] data 1
Read
register 2 Zero
Instruction
_ 0 Registers Read ALU ALU
[31 0] Write 0 Read
M data 2 result Address 1
Instruction u register M data
u M
memory Instruction [15 11] x u
1 Write x Data x
data 1 memory 0
Write
data

Instruction [15 0] 16 32
Sign
extend ALU
control
Instruction [5 0]

Figure 1

4. IEEE 754 binary representation of a 32-bit floating number is shown below


(normalized single-precision representation with bias = 127)
31 30 ~ 23 22 ~ 0
S exponent fraction
1 bit 8 bits 23 bits
(S) (E) (F)
What is the correct binary presentation of (- 0.75)10 in IEEE single-precision float
format?
(A) 1011 1111 0100 0000 0000 0000 0000 0000
(B) 1011 1111 1010 0000 0000 0000 0000 0000
(C) 1011 1111 1101 0000 0000 0000 0000 0000
(D) 1011 1110 1000 0000 0000 0000 0000 0000

Answer: (A)

5. According to Question 4, what is the decimal number represented by the word


below?
Bit position 31 30 ~ 23 22 ~ 0
Binary value 1 10000011 011000……….00
(A) -10
(B) -11
(D) -22
(D) -44

4
Answer: (A)

6. Assume that the following assembly code is run on a machine with 2 GHz clock.
The number of cycles for assembly instruction is shown in Table 1.
add $t0, $zero, $zero
loop: beq $a1, $zero finish
add $t0, $t0, $a0
sub $a1, $a1, 1
j loop
finish: addi $t0, $t0, 100
add $v0, $t0, $zero

instruction Cycles
add, addi, sub 1
lw, beq, j 2
Table 1
Assume $a0 = 3, $a1 = 20 at initial time, select the correct value of $v0 at the
final cycle:
(A) 157
(B) 160
(C) 163
(D) 166

Answer: (B)

7. According to Question 6, please calculate the MIPS (millions instructions per


second) of this assembly code:
(A) 1342
(B) 1344
(C) 1346
(D) 1348

Answer: (B)

clock rate clock rate 2 109


註:MIPS = CPI 106  clock cycles

125
 1344
10 6
10 6

instruction count 84

5
Questions 8-11. Link the following terms ((1) ~ (4))
(1) Microsoft Word
(2) Operating system
(3) Internet
(4) CD-ROM
to the most related terminology shown below (A, B, C,..., K), choose the most
related one ONLY (answer format: e.g., (1)  K, for mapping item (1) to
terminology K).
A Applications software F Personal computer
B High-level programming language G Semiconductor
C Input device H Super computer
D Integrated circuit I Systems software
E Output device K Computer Networking
Please write down the answers in the answer table together with the choice
questions.
8. (1) Microsoft word 
9. (2) Operating system 
10. (3) Internet 
11. (4) CD-ROM 

Answer:
8. (1) Microsoft word  A
9. (2) Operating system  I
10. (3) Internet  K
11. (4) CD-ROM  C

Questions 12-15. Match the memory hierarchy element on the left with the closet
phrase on the right: (answer format: e.g., (1)  d, for mapping item (1) (left) to
item d (right))
(1). L1 cache a. A cache for a cache
(2). L2 cache b. A cache for disks
(3). Main memory c. A cache for a main memory
(4). TLB d. A cache for page table entries
Please write down the answers in the answer table together with the choice
questions.
12. (1) L1 cache 
13. (2) L2 cache 
14. (3) Main memory 
15. (4) TLB 

6
Answer:
12. (1) L1 cache  a
13. (2) L2 cache  c
14. (3) Main memory  b
15. (4) TLB  d

Questions 41-50. Based on the function of the seven control signals and the datapath
of the MIPS CPU in Figure 1 (the same figure for Question 28), complete the
settings of the control lines in Table 2 (use 0, 1, and X (don‘t care) only) for the
two MIPS CPU instructions (beg and add). X (don‘t care) can help to reduce the
implementation complexity, you should put X whenever possible.
ALU Reg Reg Memto Memory Memory ALU ALU
Instr. Branch
Src Write Dst Reg Write Read Op1 Op0
beq (16) (17) (18) (19) (20) (21) (22) 0 1
add (23) (24) (25)
Table 2
Please write down the answers in the answer table together with the choice
questions.
16. (16) =
17. (17) =
18. (18) =
19. (19) =
20. (20) =
21. (21) =
22. (22) =
23. (23) =
24. (24) =
25. (25) =

Answer:
16. (16) = 1
17. (17) = 0
18. (18) = 0
19. (19) = X
20. (20) = X
21. (21) = 0
22. (22) = 0
23. (23) = 1
24. (24) = 0
25. (25) = 0

7
95 年台大電機

1-4 題為複選題
Choose ALL the correct answers for each of the following 1 to 4 questions. Note that
credit will be given only if all choices are correct.
1. With pipelines:
(A) Increasing the depth of pipelining increases the impact of hazards.
(B) Bypassing is a method to resolve a control hazard.
(C) If a branch is taken, the branch prediction buffer will be updated.
(D) In static multiple issue scheme, multiple instructions in each clock cycle are
fixed by the processor at the beginning of the program execution.
(E) Predication is an approach to guess the outcome of an instruction and to
remove the execution dependence.

Answer: (A)
註:(B) False, (should be data hazard)
註:(C) False, (prediction buffer should be updated when guess wrong)
註:(D) False, (by the compiler)
註:(E) False, (should be Speculation)

2. Increasing the degree of associativity of a cache scheme will


(A) Increase the miss rate.
(B) Increase the hit time.
(C) Increase the number of comparators.
(D) Increase the number of tag bits.
(E) Increase the complexity of LRU implementation.

Answer: (B), (C), (D), (E)


註:(A) False, (decrease the miss rate)

3. With caching:
(A) Write-through scheme improves the consistency between main memory and
cache.
(B) Split cache applies parallel caches to improve cache speed.
(C) TLB (translation-lookaside buffer) is a cache on page table, and could help
accessing the virtual addresses faster.
(D) No more than one TLB is allowed in a CPU to ensure consistency.
(E) An one-way set associative cache performs the same as a direct mapped
cache.

Answer: (A), (B), (E)


註:(C) False, (help accessing physical address faster)

8
註:(D) False, (MIPS R3000 and Pentium 4 have two TLBs)

4. In a Pentium 4 PC,
(A) DMA mechanism can be applied to delegate responsibility from the CPU.
(B) AGP bus can be used to connect MCH (Memory Control Hub) and a
graphical output device.
(C) USB 2.0 is a synchronous bus using handshaking protocol.
(D) The CPU can fetch and translate IA-32 instructions.
(E) The CPU can reduce instruction latency with deep pipelining.

Answer: (A), (B), (D)


(C) False, (USB 2.0 is an asynchronous bus)
(E) False, (pipeline can not reduce single instruction‘s latency)

5. Examine the following two CPUs, each running the same instruction set. The first
one is a Gallium Arsenide (GaAs) CPU. A 10 cm (about 4‖) diameter GaAs wafer
costs $2000. The manufacturing process creates 4 defects per square cm. The
CPU fabricated in this technology is expected to have a clock rate of 1000 MHz,
with an average clock cycles per instruction of 2.5 if we assume an infinitely fast
memory system. The size of the GaAs CPU is 1.0 cm  1.0 cm.
The second one is a CMOS CPU. A 20 cm (about 4‖) diameter CMOS wafer
costs $1000 and has 1 defect per square cm. The 1.0 cm  2.0 cm CPU executes
multiple instructions per clock cycle to achieve an average clock cycles per
instruction of 0.75, assuming an infinitely fast memory, while achieving a clock
rate of 200 MHz. (The CPU is larger because it has on-chip caches and executes
multiple instructions per clock cycle.)
Assume α for both GaAs and CMOS is 2. Yields for GaAs and CMOS wafers
are 0.8 and 0.9 respectively. Most of this information is summarized in the
following table:
Wafer Wafer Cost Defects Freq. Die Area Test Dies
CPI
Diam. (cm) Yield ($) (1/cm2) (MHz) (cm  cm) (per wafer)
GaAs 10 0.80 $2000 3.0 1000 2.5 1.0  1.0 4
CMOS 20 0.90 $1000 1.8 200 0.75 1.0  2.0 4
Hint: Here are two equations that may help:
  wafer diameter/2 2   wafer diameter
dies per wafer    test dies per wafer
die area 2  die area

 defects per unit area  die area 
dies yield  wafer yield  1 - 
  
(a) Calculate the averagae execution time for each instruction with an infinitely
fast memory. Which CPU is faster and by what factor?
(b) How many seconds will each CPU take to execute a one-billion-instruction
program?

9
(c) What is the cost of a GaAs die for the CPU? Repeat the calculation for CMOS
die. Show your work.
(d) What is the ratio of the cost of the GaAs die to the cost of the CMOS die?
(e) Based on the costs and performance ratios of the CPU calculated above, what
is the ratio of cost/performance of the CMOS CPU to the GaAs CPU?

Answer:
(a) Execution Time (GaAs) for one instruction = 2.5  1 ns = 2.5 ns
Execution Time (CMOS) for one instruction = 0.75  5 ns = 3.75 ns
GaAs CPU is faster by 3.75/2.5 =1.5 times
(b) Execution Time (GaAs) = 1  109  2.5 ns = 2.5 seconds
Execution Time (CMOS) = 1  109  3.75 ns = 3.75 seconds
  10 / 22   10
(c) GaAs: dies per wafer =   4 = 67
1 2 1
2
3 1 
die yield = 0.8  1   = 0.2
 2 
2000
Cost of a GaAs CPU die = = 149.25
67  0.2
  20 / 22   20
CMOS: dies per wafer =   4 = 121
2 2 2
2
1.8  2 
die yield = 0.9  1   = 0.576
 2 
1000
Cost of a GaAs CPU die = = 14.35
121  0.576
(d) The cost of a GaAs die is 149.25/14.35 = 10.4 times than a CMOS die
(e) The ratio of cost/performance of the CMOS to the GaAs is 10.4/1.5 = 6.93

6. Given the following 8 possible solutions for a POP or a PUSH operation in a


STACK: (1) Read from Mem(SP), Decrement SP; (2) Read from Mem(SP),
Increment SP; (3) Decrement SP, Read from Mem(SP) (4) Increment SP, Read
from Mem(SP) (5) Write to Mem(SP), Decrement SP; (6) Write to Mem(SP),
Increment SP; (7) Decrement SP, Write to Mem(SP); (8) Increment SP, Write to
Mem(SP).
Choose only ONE of the above solutions for each of the following questions.
(a) Solution of a PUSH operation for a Last Full stack that grows ascending.
(b) Solution of a POP operation for a Next Empty stack that grows ascending.
(c) Solution of a PUSH operation for a Next Empty stack that grows ascending.
(d) Solution of a POP operation for a Last Full stack that grows ascending.

Answer:
(a) (8) (b) (3) (c) (6) (d) (1)

10
使用 stack pointer (SP)指到堆疊頂端有兩種做法,一為指到堆疊頂端資料的位
址,一為指到堆疊頂端資料再上面一個位址,如下圖所示。
Address
SP big
SP

small
Last Full Next Empty
因此 Last full 的 PUSH 動作為 (1) Increase SP; (2) Write to MEM(SP)
因此 Last full 的 POP 動作為 (1) Read from MEM(SP); (2) Decrease SP
因此 Next Empty 的 PUSH 動作為 (1) Write to MEM(SP); (2) Increase SP
因此 Next Empty 的 POP 動作為 (1) Decrease SP; (2) Read from MEM(SP)

7. Execute the following Copy loop on a pipelined machine:


Copy: lw $10, 1000($20)
sw $10, 2000($20)
addiu $20, $20, -4
bne $20, $0, Copy
Assume that the machine datapath neither stalls nor forwards on hazards, so you
must add nop instructions.
(a) Rewrite the code inserting as few nop instructions as needed for proper
execution;
(b) Use multi-clock-cycle pipeline diagram to show the correctness of your
solution.

Answer: Suppose that register Read and Write could occur in the same clock cycle.
(a) lw $10, 1000($20)
Copy: addiu $20, $20, −4
nop
sw $10, 2000($20)
bne $20, $0, Copy
lw $10, 1000($20)
(b)
1 2 3 4 5 6 7 8 9 10 11
lw IF ID EX MEM WB
addiu IF ID EX MEM WB
nop IF ID EX MEM WB
sw IF ID EX MEM WB
bne IF ID EX MEM WB
lw IF ID EX MEM WB

11
8. In a Personal Computer, the optical drive has a rotation speed of 7500 rpm, a
40,000,000 bytes/second transfer rate, and a 60 ms seek time. The drive is served
with a 16 MHz bus, which is 16-bit wide.
(a) How long does the drive take to read a random 100,000-byte sector?
(b) When transferring the 100,000-byte data, what is the bottleneck?

Answer:
0.5 100000
(a) 60ms   = 60ms + 4ms + 2.5ms = 66.5ms
7500 / 60 40000000
100000 / 2
(b) The time for the bus to transfer 100000 bytes is = 3.125 ms
16  10 6
So, the optical drive is the bottleneck.

8. A processor has a 16 KB, 4-way set-associate data cache with 32-byte blocks.
(a) What is the number of sets in L1 cache?
(b) The memory is byte addressable and addresses are 35-bit long. Show the
breakdown of the address into its cache access components.
(c) How many total bytes are required for cache?
(d) Memory is connected via a 16-bit bus. It takes 100 clock cycles to send a
request to memory and to receive a cache block. The cache has 1-cycle hit
time and 95% hit rate. What is the average memory access time?
(e) A software program is consisted of 25% of memory access instructions. What
is the average number of memory-stall cycle per instruction if we run this
program?

Answer:
(a) 16KB/(32  4) = 128 sets
(b)
tag index Block offset Byte offset
23 bits 7 bits 3 bits 2 bits
(c) 27  4  (1 + 23 + 32  8) = 140 kbits = 17.5 KB
(d) Average Memory Access Time = 1 + 0.05  100 = 6 clock cycles
(e) (6 – 1)  1.25 = 6.25 clock cycles

12
94 年台大電機

1. Compare two memory system designs for a classic 5-stage pipelined processor.
Both memory systems have a 4-KB instruction cache. But system A has a
4K-byte data cache, with a miss rate of 10% and a hit time of 1 cycle; and system
B has an 8K-byte data cache, with a miss rate of 5% and a hit time of 2 cycles
(the cache is not pipelined). For both data caches, cache lines hold a single word
(4 bytes), and the miss penalty is 10 cycles. What are the respective average
memory access times for data retrieved by load instructions for the above two
memory system designs, measured in clock cycles?

Answer:
Average memory access time for system A = 1 + 0.1 × 10 = 2 cycles
Average memory access time for system B = 2 + 0.05 × 10 = 2.5 cycles

2. (a) Describe at least one clear advantage a Harvard architecture (separate


instruction and data caches) has over a unified cache architecture (a single
cache memory array accessed by a processor to retrieve both instruction and
data)
(b) Describe one clear advantage a unified cache architecture has over the Harvard
architecture

Answer:
(a) Cache bandwidth is higher for a Harvard architecture than a unified cache
architecture
(b) Hit ratio is higher for a unified cache architecture than a Harvard architecture

3. (a) What is RAID?


(b) Match the RAID levels 1, 3, and 5 to the following phrases for the best match.
Uses each only once.
Data and parity striped across multiple disks
Can withstand selective multiple disk failures
Requires only one disk for redundancy

Answer:
(a) An organization of disks that uses an array of small and inexpensive disks so
as to increase both performance and reliability
(b) RAID 5 Data and parity striped across multiple disks
RAID 1 Can withstand selective multiple disk failures
RAID 3 Requires only one disk for redundancy

13
4. (a) Explain the differences between a write-through policy and a write back policy
(b) Tell which policy cannot be used in a virtual memory system, and describe the
reason

Answer:
(a) Write through: always write the data into both the cache and the memory
Write back: updating values only to the block in the cache, then writing the
modified block to the lower level of the hierarchy when the block is replaced
(b) Write-through will not work for virtual memory, since writes take too long.
Instead, virtual memory systems use write-back

5. (a) What is a denormalized number (denorm or subnormal)?


(b) Show how to use gradual underflow to represent a denorm in a floating point
number system.

Answer:
(a) For an IEEE 754 floating point number, if the exponent is all 0s, but the
fraction is non-zero then the value is a denormalized number, which does not
have an assumed leading 1 before the binary point. Thus, this represents a
number (-1)s × 0.f × 2-126, where s is the sign bit and f is the fraction.
(b) Denormalized number allows a number to degrade in significance until it
becomes 0, called gradual underflow.
For example, the smallest positive single precision normalized number is
1.0000 0000 0000 0000 0000 0000two × 2-126
but the smallest single precision denormalized number is
0.0000 0000 0000 0000 0000 0001two × 2-126, or l.0two × 2-149

6. Try to show the following structure in the memory map of a 64-bit Big-Endian
machine, by plotting the answer in a two-row map where each row contains 8
bytes.
Struct{
int a; // 0x11121314
char* c; // ‖A‖, ―B‖, ―C‖, ―D‖, ―E‖, ―F‖, ―G‖
short e; // 0x2122
}s;

Answer:
0 1 2 3 4 5 6 7
11 12 13 14 A B C D
E F G 21 22
註: Int 佔 4 bytes 而 short 佔 2 bytes character 佔 1 byte 有壓縮時資料和指
令的記憶體位址都必頇對齊 half word 的倍數,本題 3 個 objects 要放入兩個
words,顯然必頇要壓縮,所以每個 object 必頇對齊 half word。

14
7. Assume we have the following 3 ISA styles:
(1) Stack: All operations occur on top of stack where PUSH and POP are the only
instructions that access memory;
(2) Accumulator: All operations occur between an Accumulator and a memory
location;
(3) Load-Store: All operations occur in registers, and register-to-register
instructions use 3 registers per instruction.
(a) For each of the above ISAs, write an assembly code for the following
program segment using LOAD, STORE, PUSH, POP, ADD, and SUB and
other necessary assembly language mnemonics.
{ A = A + C;
D = A − B;
}
(b) Some operations are not commutative (e.g., subtraction). Discuss what are
the advantages and disadvantages of the above 3 ISAs when executing
non-commutative operations.

Answer:
(a)
(1) Stack (2) Accumulator (3) Load-Store
PUSH A LOAD A LOAD R1, A
PUSH C ADD C LOAD R2, C
ADD STORE A ADD R1, R1, R2
POP A SUB B STORE R1, A
PUSH A STORE D LOAD R2, B
PUSH B SUB R1, R1, R2
SUB STORE R1, D
POP D
(b) Stack 和 Accumulator 這種 ISA 在執行 non-commutative operations 時對於
運算元的載入先後順序並不能任意更動,因此若在 compiler time 執行
instruction scheduling 時會比較沒有彈性。而 Load-Store ISA 對於載入
non-commutative operations 之運算元先後順序並沒有這樣的限制因此在
執行 instruction scheduling 時比較有彈性。而 Stack 和 Accumulator ISA 較
Load-Store ISA 佔用較少之記憶空間。

15
8. The program below divides two integers through repeated addition and was
originally written for a non-pipelined architecture. The divide function takes in as
its parameter a pointer to the base of an array of three elements where X is the
first element at 0($a0), Y is the second element at 4 ($a0), and the result Z is to be
stored into the third element at 8($a0), Line numbers have been added to the left
for use in answering the questions below.
1 DIVIDE: add $t3, $zero, $zero
2 add $t2, $zero, $zero
3 lw $t1, 4($a0)
4 lw $t0, 0($a0)
5 LOOP: beq $t2, $t0, END
6 addi $t3, $t3, 1
7 add $t2, $t2, $t1
8 j LOOP
9 END: sw $t3, 8($a0)
(a) Given a pipelined processor as discussed in the textbook, where will data be
forwarded (ex. Line 10 EX.MEM? Line 11 EX.MEM)? Assume that
forwarding is used whenever possible, but that branches have not been
optimized in any way and are resolved in the EX stage.
(b) How many data hazard stalls are needed? Between which instructions should
the stall bubble(s) be introduced (ex. Line 10 and Line 11)? Again, assume
that forwarding is used whenever possible, but that branches have not been
optimized in any way and are resolved in the EX stage.
(c) If X = 6 and Y = 3,
(i) How many times is the body of the loop executed?
(ii) How many times is the branch beq not taken?
(iii)How many times is the branch beq taken?
(d) Rewrite the code assuming delayed branches are used. If it helps, yon may
assume that the answer to X/Y is at least 2. Assume that forwarding is used
whenever possible and that branches are resolved in IF/ID. Do not worry
about reducing the number of times through the loop, but arrange the code to
use as few cycles as possible by avoiding stalls and wasted instructions.

Answer:
(a) Line 4 MEM.WB
(b) 1 stall is needed, between Line 4 and Line 5
(c) (i) 2 (ii) 2 (iii) 1
(d) DIVIDE: add $t2, $zero, $zero
lw $t0, 0($a0)
add $t3, $zero, $zero
lw $t1, 4($a0)
LOOP: beq $t2, $t0, END
add $t2, $t2, $t1

16
j LOOP
addi $t3, $t3, 1
END: sw $t3, 8($a0)

17
93 年台大電機

1. Explain how each of the following six features contributes to the definition of a
RISC machine: (a) Single-cycle operation, (b) Load/Store design, (c) Hardwired
control, (d) Relatively few instructions and addressing modes, (e) Fixed
instruction format, (f) More compile-time effort.

Answer:
(a) 所有的動作都限制在一個時脈週期內完成,可以簡化硬體設計並加速指
令執行。
(b) Load/Store 的設計限制 CPU 只能以暫存器作為運算元,若運算元是來自
記憶體則可使用 Load/Store 指令來存取。因為暫存器速度較記憶體快,
因此這種方式可以提高 CPU 執行指令的效率。
(c) 相對於微程化的控制單元,Hardwire Control 有較短的指令解碼時間。
(d) 可以加快指令及運算元的擷取。
(e) 可以加快指令的解碼。
(f) RISC 只提供基本指令而無功能較強之指令,因此編譯時會產生較長的程
式碼,所以頇花較長的編譯時間。

2. (1) Give an example of structural hazard.


(2) Identify all of the data dependencies in the following code. Show which
dependencies are data hazards and how they can be resolved via
forwarding?
add $2, $5, $4
add $4, $2, $5
sw $5, 100($2)
add $3, $2, $4

Answer:
(1) 假設下列指令是在只有單一記憶體的 datapath 中執行
lw $5, 100($2)
add $2, $7, $4
add $4, $2, $5
sw $5, 100($2)
在第 4 個時脈週期時指令 1 讀取記憶體資料,同時指令 4 也從同一個記
憶體中擷取指令,也就是兩個指令同時對一記憶體進行存取。在這樣情
形下就會發生 structural hazard
(2) 我們將指令編號如右: 1 add $2, $5, $4
2 add $4, $2, $5
3 sw $5, 100($2)
4 add $3, $2, $4

18
Data dependency Data hazard
$2 (1, 2) (1, 3) (1, 4) (1, 2) (1, 3)
$4 (2, 4) (2, 4)
Take instruction (1, 2) for example. We don‘t need to wait for the first instruction
to complete before trying to resolve the data hazard. As soon as the ALU creates
the sum for the first instruction, we can supply it as an input for the second
instruction.

5 4

4 5

3. Explain (1) what is a precise interrupt? (2) what does RAID mean? (3) what does
TLB mean?

Answer:
(1) An interrupt or exception that is always associated with the correct instruction
in pipelined computers.
(2) An organization of disks that uses an array of small and inexpensive disks so
as to increase both performance and reliability.
(3) A cache that keeps track of recently used address mappings to avoid an access
to the page table.

4. Consider a 32-byte direct-mapped write-through cache with 8-byte blocks.


Assume the cache updates on write hits and ignores write misses. Complete the
table below for a sequence of memory references which occur from left to right.
(Redraw the table in your answer sheet)
address 00 16 48 08 56 16 08 56 32 00 60
read/write r r r r r r r w w r r
index 0 2
tag 0 0
hit/miss miss

Answer:
Suppose the address is 8 bits. 32-byte direct-mapped cache with 8-byte blocks 
4 blocks in the cache; block offset = 3 bits, [2:0]; index = 2 bits, [4:3]; tag = 8 –
3 – 2 = 3 bits, [5:7]

19
address tag index
decimal binary binary decimal binary decimal
00 000000 0 0 00 0
16 010000 0 0 10 2
48 110000 1 1 10 2
08 001000 0 0 01 1
56 111000 1 1 11 3
16 010000 0 0 10 2
08 001000 0 0 01 1
56 111000 1 1 11 3
32 100000 1 1 00 0
00 000000 0 0 00 0
60 111100 1 1 11 3

address 00 16 48 08 56 16 08 56 32 00 60
read/write r r r r r r r w w r r
index 0 2 2 1 3 2 1 3 0 0 3
tag 0 0 1 0 1 0 0 1 1 0 1
hit/miss miss miss miss miss miss miss hit hit miss hit hit
註:題目假設在 cache write hits 時資料才會更新,而 write misses 則忽略這次的
write。倒數第 3 個 reference 32 因為是 write misses 所以被略過,因而導致倒數
第 2 個 reference 00 變為 hit。

5. (1) List two Branch Prediction strategies and (2) compare their differences.

Answer:
(1) Static prediction Dynamic prediction
(2) (a) 永遠都預測分支發生或不發 (a) 在 run time 時依 run time information
生 來預測分支是否發生
(b) 預測正確率較差, (b) 預測正確率較佳,Misprediction
Misprediction penalty 較大 penalty 較小
(c) 不需要額外硬體支援 (c) 需要額外硬體支援

6. Explain how the reference bit in a page table entry is used to implement an
approximation to the LRU replacement strategy.

Answer:
The operating system periodically clears the reference bits and later records them
so it can determine which pages were touched during a particular time period.
With this usage information, the operating system can select a page that is among

20
the least recently referenced.

7. Trace Booth‘s algorithm step by step for the multiplication of 2 × (−6)

Answer:
2 ten × (−6 ten) = 0010two × 1010 two = 1111 0100 two = −12ten
Iteration Step Multiplicand Product
0 Initial values 0010 0000 1010 0
00 no operation 0010 0000 1010 0
1
Shift right product 0010 0000 0101 0
10 prod = prod - Mcand 0010 1110 0101 0
2
Shift right product 0010 1111 0010 1
01 prod = prod + Mcand 0010 0001 0010 1
3
Shift right product 0010 0000 1001 0
10 prod = prod - Mcand 0010 1110 1001 0
4
Shift right product 0010 1111 0100 1

8. What are the differences between Trap and Interrupt?

Answer:
Interrupt:由硬體引發之非同步事件而造成 CPU 不能順利執行下一個指令(事
件來自 processor 外) 。任何時間點都可能發生且與處理器正在執行的程式無
關,通常它由輸出入裝置,處理計時器或時序產生。
Trap:因執行系統呼叫所產生之同步事件(事件來自 processor 內)而造成 CPU
不能順利執行下一個指令。Trap 可以藉由執行相同資料及狀態下的程式而重
複產生,interrupt 則否。

21
92 年台大電機

1. A certain machine with a 10 ns (10×10-9s) clock period can perform jumps (1


cycle), branches (3 cycles), arithmetic instructions (2 cycles), multiply
instructions (5 cycles), and memory instructions (4 cycles). A certain program has
10% jumps, 10% branches, 50% arithmetic, 10% multiply, and 20% memory
instructions. Answer the following question. Show your derivation in sufficient
detail.
(1) What is the CPI of this program on this machine?
(2) If the program executes 109 instructions, what is its execution time?
(3) A 5-cycle multiply-add instruction is implemented that combines an
arithmetic and a multiply instruction. 50% of the multiplies can be turned into
multiply-adds. What is the new CPI?
(4) Following (3) above, if the clock period remains the same, what is the
program‘s new execution time.

Answer:
(1) 1 × 0.1 + 3 × 0.1 + 2 × 0.5 + 5 × 0.1 + 4 × 0.2 = 2.7
(2) Execution time = 109 × 2.7 × 10 ns = 27 s
(3) CPI = (1 × 0.1 + 3 × 0.1 + 2 × 0.45 + 5 × 0.05 + 4 × 0.2 + 5 × 0.05) / (0.1 +
0.1 + 0.45 + 0.05 + 0.2 + 0.05) = 2.6 / 0.95 = 2.74
(4) Execution time = 109 × 0.95 × 2.74 × 10 ns = 26.03 s

註: 當各指令出現比例加總後不到 100%,則算出之 CPI 必頇再除上各指令出現


比例的加總。

2. Answer True (O) or False (×) for each of the following. (NO penalty for wrong
answer.)
(1) Most computers use direct mapped page tables.
(2) Increasing the block size of a cache is likely to take advantage of temporal
locality.
(3) Increasing the page size tends to decrease the size of the page table.
(4) Virtual memory typically uses a write-back strategy, rather than a
write-through strategy.
(5) If the cycle time and the CPI both increase by 10% and the number of
instruction deceases by 20%, then the execution time will remain the same.
(6) A page fault occurs when the page table entry cannot be found in the
translation lookaside buffer.
(7) To store a given amount of data, direct mapped caches are typically smaller
than either set associative or fully associative caches, assuming that the
block size for each cache is the same.
(8) The two‘s complement of negative number is always a positive number in
the same number format.

22
(9) A RISC computer will typically require more instructions than a CISC
computer to implement a given program.
(10) Pentium 4 is based on the RISC architecture.

Answer:
(1) (2) (3) (4) (5) (6) (7) (8) (9) (10)
× × O O × × O × O ×
註: Modern CPU‘s like the AthlonXP and Pentium 4 are based on a mixture of
RISC and CISC.

3. The average memory access time (AMAT) is defined as


AMAT = hit time + miss_rate × miss_penalty
Answer the following two questions. Show your derivation in sufficient detail.
(1) Find the AMAT of a 100MHz machine, with a miss penalty of 20 cycles, a hit
time of 2 cycles, and a miss rate of 5%.
(2) Suppose doubling the size of the cache decrease the miss rate to 3%, but
cause the hit time to increase to 3 cycles and the miss penalty to increase to 21
cycles. What is the AMAT of the new machine?

Answer:
(1) AMAT = (2 + 0.05 × 20) × 10ns = 30ns
(2) AMAT = (3 + 0.03 × 21) × 10ns = 36.3ns

4. If a pipelined processor has 5 stages and takes 100 ns to execute N instructions.


How long will it take to execute 2N instructions, assuming the clock rate is 500
MHz and no pipeline stalls occur?

Answer:
Clock cycle time = 1/(500×106)= 2 ns, N + 4 = 100 / 2  N = 46
The execution time of 2N instruction = 2 × 46 + 4 = 96 clock cycles = 192 ns

23
96 年台大資工

1. Answer the following questions briefly:


(a) Typically one CISC instruction, since it is more complex, takes more time to
complete than a RISC instruction. Assume that an application needs N CISC
instructions and 2N RISC instructions, and that one CISC instruction takes an
average 5T ns to complete, and one RISC instruction takes 2T ns. Which
processor has the better performance?
(b) Which of the following processors have a CISC instruction set architecture?
ARM AMD Opteron
Alpha 21164 IBM PowerPC
Intel 80x86 MIPS
Sun UltraSPARC
(c) True & False questions;
(1) There are four types of data hazards; RAR, RAW, WAR, and WAW.
(True or False?)
(2) AMD and Intel recently added 64-bit capability to their processors
because most programs run much faster with 64-bit instructions. (True or
False?)
(3) With a modern processor capable of dynamic instruction scheduling and
out-of-order execution, it is better that the compiler does not to optimize
the instruction sequences, (True or False?)

Answer:
(a) CISC time = N × 5T = 5NT ns
RISC time = 2N × 2T = 4NT ns
RISC time < CISC time, so the RISC architecture has better performance.
(b) Intel 80x86, AMD Opteron
(c) (1) False, RAR does not cause data hazard.
(2) False, most programs run much faster with 64-bit processors not 64-bit
instructions
(3) False, the compiler still tries to help improve the issue rate by placing the
instructions in a beneficial order.

2. For commercial applications, it is important to keep data on-line and safe in


multiple places.
(a) Suppose we want to backup 100GB of data over the network. How many
hours does it take to send the data by FTP over the Internet? Assume the
average bandwidth between the two places is 1Mbits/sec.

24
(b) Would it be better if you burn the data onto DVDs and mail the DVDs to the
other site? Suppose it takes 10 minutes to bum a DVD which has 4GB
capacity and the fast delivery service can deliver in 12 hours.

Answer:
(a) (100Gbyte)/1Mbits = 800  1024 seconds = 227.56 hours
(b) (100GB/4GB)  10 minutes = 250 minutes = 4.17 hours
4.17 + 12 = 16.17 hours < 227.56 hours
So, it is better to burn the data into DVDs and mail them to other site.

3. Suppose we have an application running on a shared-memory multiprocessor.


With one processor, the application runs for 30 minutes.
(a) Suppose the processor clock rate is 2GHz. The average CPI (assuming that all
references hit in the cache) on single processor is 0.5. How many instructions
are executed in the application?
(b) Suppose we want to reduce the run time of the application to 5 minutes with 8
processors. Let's optimistically assume that parallelization adds zero overhead
to the application, i.e. no extra instructions, no extra cache misses, no
communications, etc. What fraction of the application must be executed in
parallel?
(c) Suppose 100% of our application can be executed in parallel. Let's now
consider the communication overhead. Assume the multiprocessor has a 200
ns time to handle reference to a remote memory and processors are stalled on
a remote request. For this application, assume 0.02% of the instructions
involve a remote communication reference, no matter how many processors
are used. How many processors are needed at least to make the run time be
less than 5 minutes?
(d) Following the above question, but let's assume the remote communication
references in the application increases as the number of processors increases.
With N processors, there are 0.02*(N−1)% instructions involve a remote
communication reference. How many processors will deliver the maximum
speedup?

Answer:
(a) 30  60 second = Instruction count  0.5  0.5 ns
Instruction Count =1800 /0.25 ns = 7200  109
(b) Suppose that the fraction of the application must be executed in parallel = F.

25
30 1
So, 5  F  F = 20/21 = 0.952
(1  F ) 
8
(c) Assume N is the number of processors that will make run time < 5 minutes
(30  60)/N + 7200  109  0.0002  200 ns < 5  60  N > 150
So, at lease 150 processors are needed to make the rum time < 5 minutes
30  60
(d) Speedup = + 7200  109  0.0002  (N – 1)  200 ns
N
–1
= 1800N + 288 (N – 1)
Let the derivative of Speedup = 0  –1800N–2 + 288 = 0  N = 2.5
2.5 processors ill deliver the maximum speedup

4. Number representation.
(a) What range of integer number can be represented by 16-bit 2's complement
number?
(b) Perform the following 8-bit 2's complement number operation and check
whether arithmetic overflow occurs. Check your answer by converting to
decimal sign-and-magnitude representation.
11010011
− 11101100

Answer:
(a) − 215 ~ + (215 – 1)
(b) 11010011 – 11101100 = 11010011 + 00010100 = 11100111
check: – 45 – (– 20) = – 45 + 20 = – 25
The range for 8-bit 2‘s complement number is: − 27 ~ + (27 – 1)
So, no overflow

5. Bus
(a) Draw a graph to show the memory hierarchy of a system that consists of CPU,
Cache, Memory and I/O devices. Mark where memory bus and I/O bus is.
(b) Assuming system 1 have a synchronous 32-bit bus with clock rate = 33 Mhz
running at 2.5V. System 2 has a 64-bit bus with clock rate = 66 Mhz running
at 1.8V. Assuming the average capacitance on each bus line is 2pF for bus in
system 1. What is the maximum average capacitance allowed for the bus of
system 2 so the peak power dissipation of system 2 bus will not exceed that of
the system 1 bus?
(c) Serial bus protocol such as SATA has gained popularity in recent years. To
design a serial bus that supports the same peak throughput as the bus in
system 2, what is the clock frequency of this serial bus?

26
Answer:
(a)

(b) Power dissipation = fCV2


The peak power dissipation for system 1 =
33  106  (2  10-12  32)  2.52 = 13.2 mw
Suppose the average capacitance for system 2 = C
66  106  C  1.82 < 13.2 mw  C < 61.73 pF
The maximum average capacitance for system 2 is 61.73 pF.
(c) Since SATA uses a single signal path to transmit data serially (or bit by bit),
the frequency should be designed as 66 MHz  64 = 4.224 GHz to support the
same peak throughtput as bus system 2.
註(b):題目並未要求算出 system 2 單一 bus line 之電容量

27
95 年台大資工

PART I:
Please answer the following questions in the format listed below. If you do not follow
the format, you will get zero points for these questions.
1. (1) T or F
(2) T or F
(3) T or F
(4) T or F
(5) T or F
2. X = Y=
Stall cycles =
3. Option is times faster than the old machine
4. 1-bit predictor: 2-bit predictor:

1. True & False Questions


(1) If an address translation for a virtual page is present in the TLB, then that
virtual page must be mapped to a physical memory page.
(2) The set index decreases in size as cache associativity is increased (assume
cache size and block size remain the same)
(3) It is impossible to have a TLB hit and a data cache miss for the same data
reference.
(4) An instruction takes less time to execute on a pipelined processor than on a
nonpipelined processor (all other aspects of the processors being the same).
(5) A muti-cycle implementation of the MIPS processor requires that a single
memory by used for both instructions and data.

Answer:
(1) T
(2) T
(3) F
(4) F
(5) T

2. Consider the following program:


int A[100]; /* size(int) = 1 word */
for (i = 0; i < 100; i++)
A[i] = A[i] + 1;
The code for this program on a MIPS-like load/store architecture looks as
follows:
ADDI R1, R0, #X
ADDI R2, R0, A ; A is the base address of array A
LOOP: LD R3, 0(R2)

28
ADDI R3, R3, #1
SD R3, 0(R2)
ADDI R2, R2, #Y
SUBI R1, R1, #1
BNE R1, R0, LOOP
Consider a standard 5-stage MIPS pipeline. Assume that the branch is resolved
during the instruction decode stage, and full bypassing/register forwarding are
implemented. Assume that all memory references hit in the cache and TLBs. The
pipeline does not implement any branch prediction mechanism. What are values
of #X and #Y, and how many stall cycles are in one loop iteration including stalls
caused by the branch instruction?

Answer:
X = 100
Y=4
Stall cycles = 3 ((1) between LD and ADDI, (2) between SUBI and BEQ, (3) one
below BEQ)
註:Since branch decision is resolved during ID stage, a clock stall is needed
between SUBI and BEQ

3. Suppose you had a computer hat, on average, exhibited the following properties
on the programs that you run:
Instruction miss rate: 2%
Data miss rate: 4%
Percentage of memory instructions: 30%
Miss penalty: 100 cycles
There is no penalty for a cache hit (i.e. the cache can supply the data as fast as the
processor can consume it.) You want to update the computer, and your budget will
allow one of the following:
 Option #1:Get a new processor that is twice as fast as your current
computer. The new processor‘s cache is twice as fast too, so it
can keep up with the processor.
 Option #2:Get a new memory that is twice as fast.
Which is a better choice? And what is the speedup of the chosen design compared
to the old machine?

Answer:
Option 2 is 4.2/2.6 = 1.62 times faster than the old machine.
Suppose that the base CPI = 1
CPIold = 1 + 0.02  100 + 0.04  0.3  100 = 4.2
CPIopt1 = 0.5 + 0.02  100 + 0.04  0.3  100 = 3.7
CPIopt2 = 1 + 0.02  50 + 0.04  0.3  50 = 2.6

29
註(option#1):processor 快兩倍且 cache 跟得上,表示在不考慮 stall 的情況下一
個指令的執行時間會減半。一個指令的執行時間= CPI  cycle time。本題並未假
設 clock rate double,所以 cycle time 並未減半,因此 CPI 要減半才能為反應出指
令執行時間減半。題目沒特別說明 base CPI 都假設為 1,因此 option #1 的 CPI
要變成 0.5 才可說明 processor 快兩倍且 cache 跟得上的事實。

4. The following series of branch outcomes occurs for a single branch in a program.
(T means the branch is taken, N means the branch is not taken).
TTTNNTTT
How many instances of this branch instruction are mis-predicted with a 1-bit and
2-bit local branch predictor, respectively? Assume that the BHT are initialized to
the N state. You may assume that this is the only branch is the program.

Answer:
1-bit predictor: 3 2-bit predictor: 5
註: 以第四版教科書之 FSM 演算,2-bit predictor 總共錯 5 次。若以第三版教科
書之 FSM 演算,則 2-bit predictor 總共錯 6 次。

PART II:
For the following questions in Part II, please make sure that you summarize all your
answer in the format listed below. The answers are short, such as alphabelts, numbers,
or yes/no. You do not have to show your calculations. There is no partial credit to
incorrect answers.

(5a) (5b)
(6a) (6b) (6c)
(7a) (7b) (7c)
(8a) (8b) (8c) (8d) (8e)
(9a) (9b) (9c) (9d) (9e)

5. Consider the following performance measurements for a program:


Measurement Computer A Computer B Computer C
Instruction Count 12 billion 12 billion 10 billion
Clock Rate 4 Ghz 3 Ghz 2.8 Ghz
Cycles Per Instruction 2 1.5 1.4
(5a) Which computer is faster?
(5b) Which computer has the higher MIPS rating?

Answer:
(5a) Computer C

30
12  10 9  2
Execution Time for Computer A = 6
9
4  10
12  10 9  1.5
Execution Time for Computer B = 6
3  10 9
10  10 9  1.4
Execution Time for Computer C = 5
9
2.8  10
(5b) The MIPS rates for all computers are the same.
4  10 9
MIPS for computer A =  2000
6
2  10
3  10 9
MIPS for computer B =  2000
1.5  10 6
2.8  10 9
MIPS for computer B =  2000
6
1.4  10

6. Consider the following two components in a computer system:


 A CPU that sustain 2 billion instructions per second.
 A memory backplane bus capable of sustaining a transfer rate of 1000
MB/sec
If the workload consists of 64 KB reads from the disk, and each read operation
takes 200,000 user instructions and 100,000 OS instructions.
(6a) Calculate the maximum I/O rate of CPU.
(6b) Calculate the maximum I/O rate of memory bus.
(6c) Which of the two components is likely to be the bottlenect for I/O?

Answer:
(6a) 6667
(6b) 15625
(6c) CPU
2  109
The maximum I/O rate of CPU =  6667
100000  200000
1000  10 6
Calculate the maximum I/O rate of memory bus =  15625
64  10 3

31
7. You are going to enhance a computer, and there are two possible improvements:
either make multiply instructions run four times faster than before, or make
memory access instructions run two times faster than before. You repeatedly run
a program that takes 100 seconds to execute. Of this time, 20% is used for
multiplication, 50% for memory access instructions, and 30% for other tasks.
Calculate the speedup:
(7a) Speedup if we improve only multiplication:
(7b) Speedup if we only improve memory access:
(7c) Speedup if both improvements are made:

Answer:

1
(7a) Speedup = 0.2  1.18
 0.8
4
1
(7b) Speedup = 0.5  1.33
 0.5
2
1
(7c) Speedup = 0.2 0.5  1.67
  0.3
4 2

8. Multiprocessor designs have become popular for today‘s desktop and mobile
computing. Given a 2-way symmetric multiprocessor (SMP) system where both
processors use write-back caches, write update cache coherency, and a block size
of one 32-bit word. Let us examine the cache coherence traffic with the following
sequence of activities involving shared data. Assume that all the words already
exist in both caches and are clean. Fill-in the last column (8a)-(8e) in the table to
identify the coherence transactions that should occur on the bus for the sequence.
Transaction
Step Processor Memory activity Memory address required
(Yes or No)
1 Processor 1 1-word write 100 (8a)
2 Processor 2 1-word write 104 (8b)
3 Processor 1 1-word read 100 (8c)
4 Processor 2 1-word read 104 (8d)
5 Processor 1 1-word read 104 (8e)

Answer:

32
(8a) Yes
(8b) Yes
(8c) No
(8d) No
(8e) No

9. False sharing can lead to unnecessary bus traffic and delays. Follow the direction
of Question 8, except change the cache coherency policy to write-invalidate and
block size to four words (128-bit). Reveal the coherence transactions on the bus
by filling-in the last column (9a)-(9e) in the table below.
Transaction
Step Processor Memory activity Memory address required
(Yes or No)
1 Processor 1 1-word write 100 (9a)
2 Processor 2 1-word write 104 (9b)
3 Processor 1 1-word read 100 (9c)
4 Processor 2 1-word read 104 (9d)
5 Processor 1 1-word read 104 (9e)

Answer:
(9a) Yes
(9b) Yes
(9c) Yes
(9d) No
(9e) No

註:本題答案是以白算盤教科書第四版的snoopy protocol為標準,若以第三版
為標準則答案(9d)則為Yes。

33
94 年台大資工

1. Suppose we have a 32 bit MIPS-like RISC processor with the following


arithmetic and logical instructions (along with their descriptions):
Addition
add rd, rs, rt  Put the sum of registers rs and rt into register rd.
Addition immediate
add rt, rs, imm  Put the sum of register rs and the sign-extended
immediate into register rt.
Subtract
sub rd, rs, rt  Register rt is subtracted from register rs and the result is
put in register rd.
AND
and rd, rs, rt  Put the logical AND of register rs and rt into register rd.
AND immediate
and rt, rs, imm  Put the logical AND of register rs and the zero-extended
immediate into register rt.
Shift left logical
sll rd, rt, imm  Shift the value in register rt left by the distance (i.e. the
number of bits) indicated by the immediate (imm) and
put the result in register rd. The vacated bits are filled
with zeros.
Shift right logical
srl rd, rt, imm  Shift the value in register rt right by the distance (i.e. the
number of bits) indicated by the immediate (imm) and
put the result in register rd. The vacated bits are filled
with zeros.
Please use at most one instruction to generate assembly code for each of the
following C statements (assuming variable a and b are unsigned integers). You
can use the variable names as the register names in your assembly code.
(a) b = a / 8; /* division operation */
(b) b = a % 16; /* modulus operation */

Answer:
(a) srl b, a, 3
(b) and b, a, 15

註:二進位數除以二的冪次方數基本上可以不必真的除,只要在被除數適當位
置切開左邊即為商數右邊即為餘數。例如 a = 10010011;a 除 16 則商數為
1001、餘數為 0011;a 除 8 則商數為 10010、餘數為 011。

34
2. Assume a RISC processor has a five-stage pipeline (as shown below) with each
stage taking one clock cycle to finish. The pipeline will stall when encountering
data hazards.
IF ID EXE MEM WB
IF: Instruction fetch
ID: Instruction decode and register file read
EXE: Execution or address calculation
MEM: Data memory access
WB: Write back to register file
(a) Suppose we have an add instruction followed immediately by a subtract
instruction that uses the add instruction's result:
add r1  r2, r3
sub r5  r1, r4
If there is no forwarding in the pipeline, how many cycle(s) will the pipeline
stall for?
(b) If we want to use forwarding (or bypassing) to avoid the pipeline stall caused
by the code sequence above, choosing from the denoted 6 points (A to F) in
the following simplified data path of the pipeline, where (from which point to
which point) should the forwarding path be connected?

IF ID EXE MEM WB
A B C D E F

(c) Suppose the first instruction of the above code sequence is a load of r1 instead
of an add (as shown below).
load rl  [r2]
sub r5  r1, r4
Assuming we have a forwarding path from point E to point C in the pipeline
data path, will there be any pipeline stall for this code sequence? If so, how
many cycle(s)? (If your first answer is yes, you have to answer the second
question correctly to get the 5 pts credit.)

Answer:
(a) 如果read register和write register可以在同一個clock cycle發生則僅頇stall 2
cycles否則頇stall 3個clock cycles
(b) D to C
(c) Yes, 1 clock cycle

3. Cache misses are classified into three categories-compulsory, capacity, and


conflict. What types of misses could be reduced if the cache block size is
increased?

Answer: compulsory

35
4. Consider three types of methods for transferring data between an I/O device and
memory: polling, interrupt driven, and DMA. Rank the three techniques in terms
of lowest impact on processor utilization

Answer: (1) DMA, (2) Interrupt driven, (3) Polling

5. Assume an instruction set that contains 5 types of instructions: load, store,


R-format, branch and jump. Execution of these instructions can be broken into 5
steps: instruction fetch, register read, ALU operations, data access, and register
write. Table 1 lists the latency of each step assuming perfect caches.
Instruction Instruction Register ALU Data Register
class fetch read operation access write
Load 2ns
1ns 1ns
Store 2ns 1ns
1ns 1ns 2ns
R-fonnat 2ns
1ns 1ns 2ns
Branch 2ns 1ns
1ns 1ns
Jump 2ns
Table 1
(a) What is the CPU cycle time assuming a multicycle CPU implementation (i.e.,
each step in Table 1 takes one cycle)?
(b) Assuming the instruction mix shown below, what is the average CPI of the
multicycle processor without pipelining? Assume that the I-cache and
D-cache miss rates are 3% and 10%, and the cache miss penalty is 12 CPU
cycles
Instruction Type Frequency
Load 40%
Store 30%
R-format 15%
Branch 10%
Jump 5%
(c) To reduce the cache miss rate, the architecture team is considering increasing
the data cache size. They find that by doubling the data cache size, they can
eliminate half of data cache misses. However, the data access stage now takes
4 ns. Do you suggest them to double the data cache size? Explain your
answer.

Answer:
(a) 取最長執行步驟的時間作為 CPU cycle time 因此 multicycle
implementation 的 CPU cycle time 為 2ns
(b) CPI without considering cache misses = 5 × 0.4 + 4 × 0.3 + 4 × 0.15 + 3 × 0.1
+ 1 × 0.05 = 4.15
Average CPI = 4.15 + 0.03 × 12 + (0.3 + 0.4) × 0.1 × 12 = 5.35

36
(c) CPI after doubling data cache = 4.15 + 0.03  6 + (0.3 + 0.4)  0.05  6 =
4.54
Average instruction execution time before doubling data cache = 5.35 × 2ns =
10.7 ns
Average instruction execution time after doubling data cache = 4.54 × 4ns =
18.16 ns
Doubling data cache 後之平均指令執行時間較 Doubling data cache 前之平
均指令執行時間長因此並不建議 double the data cache.

37
93 年台大資工

1. Consider a system with an average memory access time of 50 nanoseconds, a


three level page table (meta-directory, directory, and page table). For full credit,
your answer must be a single number and not a formula.
(a) If the system had an average page fault rate of 0.01% for any page accessed
(data or page table related), and an average page fault took 1 millisecond to
service, what is the effective memory access time (assume no TLB or memory
cache)?
(b) Now assume the system has no page faults, we are considering adding a TLB
that will take 1 nanosecond to lookup an address translation. What hit rate in
the TLB is required to reduce the effective access time to memory by a factor
of 2.5?

Answer:
(a) 在不考慮 page fault 的情況下正常的 effective memory access time = 4 × 50
= 200 ns (一次 meta-directory,一次 directory,一次 page table 再加上一次
data access),page fault rate = 0.01%,因此 effective memory access time =
200 + 4 × 0.01% × 1000000ns = 600 ns
(b) (200 / 2.5) = 50ns + 1ns + 150ns × (1 − H)  H = 0.81

2. In this problem set, show your answers in the following format:


<a> ? CPU cycles
Derive your answer.
<b> CPI = ?
Derive your answer.
<c> Machine ? is ?% faster than ?
Derive your answer.
<d> ? CPU cycles
Derive your answer.
Both machine A and B contain one-level on-chip caches. The CPU clock rates
and cache configurations for these two machines are shown in Table 1. The
respective instruction/data cache miss rates in executing program P are also
shown in Table 1. The frequency of load/store instructions in program P is 20%.
On a cache miss, the CPU stalls until the whole cache block is fetched from the
main memory. The memory and bus system have the following characteristics:
1. the bus and memory support 16-byte block transfer;
2. 32-bit synchronous bus clocked at 200 MHz, with each 32-bit transfer taking
1 bus clock cycle, and 1 bus clock cycle required to send an address to
memory (assuming shared address and data lines);
3. assuming there is no cycle needed between each bus operation;
4. a memory access time for the first 4 words (16 bytes) is 250 ns, each
additional set of four words can be read in 25 ns. Assume that a bus transfer

38
of the most recently read data and a read of the next four words can be
overlapped.
Machine A Machine B
CPU clock rate 800 MHz 400 MHz
I-cache Direct-mapped, 2-way, 32-byte block,
configuration 32-byte block, 8K 128K
D-cache 2-way, 32-byte block, 4-way, 32-byte block,
configuration 16K 256K
I-cache miss rate 6% 1%
D-cache miss rate 15% 4%
Table 1
To answer the following questions, you don't need to consider the time required
for writing data to the main memory:
(1) What is the data cache miss penalty (in CPU cycles) for machine A?
(2) What is the average CPI (Cycle per Instruction) for machine A in executing
program P? The CPI (Cycle per Instruction) is 1 without cache misses.
(3) Which machine is faster in executing program P and by how much? The CPI
(Cycle per Instruction) is 1 without cache misses for both machine A and B.
(4) What is the data cache miss penalty (in CPU cycles) for machine A if the bus
and memory system support 32-byte block transfer? All the other memory/bus
parameters remain the same as defined above.

Answer:
(a) 440 CPU cycles
Since bus clock rate = 200 MHz, the cycle time for a bus clock = 5 ns
The time to transfer one data block from memory to cache = 2 × (1 + 250/5 + 1 ×
4) × 5 ns = 550 ns
The data miss penalty for machine A = 550ns / (1/800MHz) = 440 CPU cycles
(b) CPI = 40.6
Average CPI = 1 + 0.06 × 440 + 0.2 × 0.15 × 440 = 40.6
(c) Machine B is 409% faster than A
因為 machine A 和 machine B 的 cache block size 都一樣都是 32-byte,因此其
miss penalty 也一樣是 550ns
因為 machine B 的 clock rate 為 400Mhz 所以 machine B 的 miss penalty = 220
clock cycles
machine B 的 average CPI = 1 + 0.01 × 220 + 0.2 × 0.04 × 220 = 4.96
Execution time for machine A = 40.6 × 1.25 × IC =50.75IC
Execution time for machine B = 4.96 × 2.5 × IC =12.4IC
因此 machine B is 50.75/12.4 = 4.09 faster than machine A
(d) 240 CPU cycles
The time to transfer one data block from memory to cache = (1 + 250/5 + 25/5 +
4) × 5 ns = 300 ns

39
The data miss penalty for machine A = 300ns / (1/800MHz) = 240 CPU cycles

3. Given the bit pattern 10010011, what does it represent assuming


(a) It‘s a two‘s complement integer?
(b) It‘s an unsigned integer?
Write down your answer in decimal format.

Answer:
(a) -27 + 24 + 21 + 20 = - 109
(b) 27 + 24 + 21 + 20 = 147

4. Draw the schematic for a 4-bit 2‘s complement adder/subtractor that produces A +
B if K=1, A − B if K = 0. In your design try to use minimum number of the
following basic logic gates (1-bit adders, AND, OR, INV, and XOR).

Answer:
K = 0: S = A + (B ⊕ 1) + 1 = A + (B‘ + 1) = A − B
K = 1: S = A + (B⊕ 0) + 0 = A + B + 0 = A + B
a3 b3 a2 b2 a1 b1 a0 b0

c4 + + + + K

s3 s2 s1 s0

5. We want to add for 4-bit numbers, A[3:0], B[3:0], C[3:0], D[3:0] together using
carry-save addition. Draw the schematic using 1-bit full adders.

Answer:
c3 d3 c2 d2 c1 d1 c0 d0

40
6. We have an 8-bit carry ripple adder that is too slow. We want to speed it up by
adding one pipeline stages. Draw the schematic of the resulting pipeline adder.
How many 1-bit pipeline register do you need? Assuming the delay of 1-bit adder
is 1 ns, what‘s the maximum clock frequency the resulting pipelined adder can
operate?

Answer:
(1) schematic
c0
a0
b0 + s0

a1
b1 + s1

a2
b2 + s2

a3
b3 + s3

a4
b4 + s4

a5
+ s5
b5

a6
+ s6
b6

a7
+ s7
b7
c8

(2) 13 1-bit pipeline registers


(3) 1/4ns = 250MHz

41
92 年台大資工

1. A pipelined processor architecture consists of 5 pipeline stages: instruction fetch


(IF), instruction decode and register read (ID), execution or address calculate
(EX), data memory access (MEM), and register write back(WB). The delay of
each stage is summarized below: IF = 2 ns, ID = 1.5 ns, EX = 4 ns, MEM = 2.5 ns,
WB = 2 ns.
(1) What‘s the maximum attainable clock rate of this processor?
(2) What kind of instruction sequence will cause data hazard that cannot be
resolved by forwarding? What‘s the performance penalty?
(3) To improve on the clock rate of this processor, the architect decided to add
one pipeline stage. The location of the existing pipeline registers cannot be
changed. Where should this pipeline stage be placed? What‘s the maximum
clock rate of the 6-stage processor? (Assuming there is no delay penalty when
adding pipeline stages)
(4) Repeat the analysis in (2) for the new 6-stage processor. Is there other type(s)
of instruction sequence that will cause a data hazard, and cannot be resolved
by forwarding? Compare the design of 5-stage and 6-stage processor, what
effect does adding one pipeline stage has on data hazard resolution?

Answer:
(1) 最長stage的delay為4 ns因此clock rate = 1/ (4 ×10-9) = 250 MHz
(2) (a) 當Load指令後面立即接一使用其目的暫存器為運算來源之指令時便
無法以forwarding 來解決data hazard。
(b) 需要stall一clock cycle後再以forwarding來解決,因此the performance
penalty為1個clock cycle delay。
(3) (a) EX階段的delay最長,因此可將其劃分成delay皆為2 ns之EX1及EX2階
段。
(b) 此時最長stage的delay為2.5 ns因此clock rate = 1/ (2.5 ×10-9) = 400
MHz。
(4) (a) 此時不但Load-use此種data hazard無法使用forwarding來解決,連一般
data hazard也無 法使用forwarding來解決。Load-use data hazard頇stall
2個clock cycle而一般data hazard頇stall 1個clock cycle
(b) 解決data hazard的penalty增多了。

2. (1) What type of cache misses (compulsory, conflict, capacity) can be reduced by
increasing the cache block size?
(2) Can increasing the degree of the cache associativity always reduce the average
memory access time?Explain your answer.

Answer:

42
(1) Compulsory
(2) No. AMAT = hit time + miss rate  miss penalty. Increase the degree of cache
associativity may decrease miss rate but will lengthen hit time; therefore, the
average memory access time may not necessary be reduced.

3. List two types of cache write policies. Compare the pros and cons of these two
polices.

Answer:
(1) Write-through: A scheme in which writs always update both the cache and the
memory, ensuring that data is always consistent between the two.
Write-back: A scheme that handles writes by updating values only to the
block in the cache, then writing the modified block to the lower level of the
hierarchy when the block is replaced.
(2)
Polices Write-through Write-back
只有當block要被置換時才需寫回
Pros 實作容易
Memory,因此CPU write速度較快
每次寫入cache時也都要寫入
Cons 實作較難
Memory,因此CPU write速度較慢

4. Briefly describe the difference between synchronous and asynchronous bus


transaction

Answer:
Bus type Synchronous Bus Asynchronous Bus
Synchronous bus includes a clock
in the control lines and a fixed Asynchronous bus is not
Differences
protocol for communication clocked
relative to clock
It can accommodate a wide
very little logic and can run very range of devices.
Advantage
fast It can be lengthened without
worrying about clock skew.
Every device on the bus must run
at the same clock rate. It requires a handshaking
Disadvantage
To avoid clock skew, they cannot protocol.
be long if they are fast

43
96 年清大電機

1. The following MIPS assembly program tries to copy words from the address in
register $a0 to the address in $a1, counting the number of words copied in
register $v0. The program stops copying when it finds a word equal to 0, You do
not have to preserve the contents of registers $v1, $a0, and $a1. This terminating
word should be copied but not counted.
loop: lw $v1, 0($a0) # read next word from source
addi $v0, $v0, 1 # Increment count words copied
sw $v1, 0($a1) # Write to destination
addi @a0, $a0, 1 # Advance pointer to next word
addi @a0, $a1, 1 # Advance pointer to next word
bne $v1, $zero, loop # Loop if word copied != zero
There are multiple bugs in this MIPS program; fix them and turn in a bug-free
version.

Answer:
addi $v0, $zero, −1
Loop: lw $v1, 0($a0)
addi $v0, $v0, 1
sw $v1, 0($a1)
addi $a0, $a0, 4
addi $a1, $a1, 4
bne $v1, $zero, Loop

2. Carry lookahead is often used to speed up the addition operation in ALU. For a
4-bit addition with carry lookahead, assuming the two 4-bit inputs are a3a2a1a0
and b3b2b1b0, and the carry-in is c0,
(a) First derive the recursive equations of carry-out ci+1 in terms of ai and bi and
ci, where i = 0, 1,.., 3.
(b) Then by defining the generate (gi) and propagate (pi) signals, express c1, c2,
c3, and c4 in terms of only gi's, pi‘s and c0.
(c) Estimate the speed up for this simple 4-bit carry lookahead adder over the
4-bit ripple carry adder (assuming each logic gate introduces T delay).

Answer:
(a) ci+1 = aibi + aici + bici
(b) c1 = g0 + (p0·c0)
c2 = g1 + (p1·g0) + (p1·p0·c0)
c3 = g2 + (p2·g1) + (p2·p1·g0) + (p2·p1·p0·c0)
c4 = g3 + (p3·g2) + (p3·p2·g1) + (p3·p2·p1·g0) + (p3·p2·p1·p0·c0)
(c) The critical path delay for 4-bit ripple carry adder = 2T  4 = 8T
The critical path delay for 4-bit ripple carry adder = 2T + T = 3T

44
Speedup = 8T/3T = 2.67
註: 白算盤教科書於加法器設計時一律將 critical path dealy 定義為產生所有 carry
所需之延遲時間。

3. When performing arithmetic addition and subtraction, overflow might occur. Fill
in the blanks in the following table of overflow conditions for addition and
subtraction.
Result indicating
Operation Operand A Operand B
overflow
A+B ≥0 ≥0 (a)
A+B <0 <0 (b)
A–B ≥0 <0 (c)
A–B <0 ≥0 (d)

Prove that the overflow condition can be determined simply by checking to see if
the Carryin to the most significant bit of the result is not the same as the CarryOut
of the most significant bit of the result.

Answer:
(1)
Operation Operand A Operand B Result indicating overflow
A+B ≥0 ≥0 (a) < 0
A+B <0 <0 (b) ≥ 0
A–B ≥0 <0 (c) < 0
A–B <0 ≥0 (d) ≥ 0
(2) Build a table that shows all possible combinations of Sign and CarryIn to the
sign bit position and derive the CarryOut, Overflow, and related information.
Thus
Sign Correct Over CarryIn
Sing Sing Carry Carry
of Sign of flow XOR Notes
A B In Out
result Result ? CarryOut
0 0 0 0 0 0 No 0
0 0 1 0 1 0 Yes 1 Carries differ
0 1 0 0 1 1 No 0 |A| < |B|
0 1 1 1 0 0 No 0 |A| > |B|
1 0 0 0 1 1 No 0 |A| > |B|
1 0 1 1 0 0 No 0 |A| < |B|
1 1 0 1 0 1 Yes 1 Carries differ
1 1 1 1 1 1 No 0
From this table an XOR of the CarryIn and CarryOut of the sign bit serves to
detect overflow.

45
4. Assume all memory addresses are translated to physical addresses before the
cache is accessed. In this case, the cache is physically indexed and physically
tagged. Also assume a TLB is used. (a) Under what circumstance can a memory
reference encounter a TLB miss, a page table hit, and a cache miss? Briefly
explain why. (b) To speed up cache accesses, a processor may index the cache
with virtual addresses. This is called a virtually addressed cache, and it uses tags
that are virtual addresses. However, a problem called aliasing may occur. Explain
what aliasing is and why. (c) In today's computer systems, virtual memory and
cache work together as a hierarchy. When the operating system decides to move a
page back to disk, the contents of that page may have been brought into the cache
already. What should the OS do with the contents that are in the cache?

Answer:
(a) Data/instruction is in memory but not in cache and page table has this
mapping but TLB has not.
(b) A situation in which the same object is accessed by two addresses; can occur
in virtual memory when there are two virtual addresses for the same physical
page.
(c) If the contents in cache are dirty, force them write back to memory and
invalidate them in cache. After that, copy the page back to disk. If not,
invalidate them in cache and copy the page back to disk.

5. The following three instructions are executed using MIPS 5-stage pipeline.
1. lw $2, 20($1)
2. sub §4, $2, $5
3. or $4, $2, $6
Since there is one cycle delay between lw and sub, a hazard detection unit is
required. Furthermore, by the time the hazard is detected, sub and or may have
already been fetched into the pipeline. Therefore it is also required to turn sub
into a nop and delay the execution of sub and or by one cycle as shown below.
1. lw $2, 20($1)
2 nop
3. sub $4, $2, $5
4. or $4, $2, $6
(a) In which stage should the hazard detection unit be placed? Why? (b) How can
you turn sub into a nop in MIPS 5-stage pipeline? (c) How can you prevent sub
and or from making progress and force these two instructions to repeat in the next
clock cycle? (d) Explain why there is one cycle delay between lw and sub.

Answer:
(a) ID: Instruction Decode and register file read stage.
(b) Deassert all nine control signals (in EX/MEM pipeline register) in the EX
stage.
(c) Set both control signals PCWrite and IF/IDWrite to 0 to prevent the PC
register and IF/ID pipeline register from changing.

46
(d) As shown in the following diagram, after 1-cycle stall between lw and sub,
the forwarding logic can handle the dependence and execution proceeds. (If
there were no forwarding, then 2 cycle delay is needed)
lw IF ID EX MEM WB
nop IF ID EX MEM WB
sub IF ID EX MEM WB

6. Answer the following questions briefly.


(a) Will addition "0010 + 1110" cause an overflow using the 4-bit two's
complement signed-integer form? (Simply answer yes or no).
(b) What would you get after performing arithmetic right shift by one bit on
1100two?
(c) If one wishes to increase the accuracy of the floating-point numbers that can
be represented, then he/she should increase the size of which part in the
floating-point format?
(d) Name one event other than branches or jumps that the normal flow of
instruction execution will be changed, e.g., by switching to a routine in the
operating system.

Answer:
(a) NO
(b) 1110two
(c) Fraction
(d) Arithmetic overflow

7. A MIPS instruction takes fives stages in a pipelined CPU design: (1) IF:
instruction fetch, (2) ID: instruction decode/register fetch, (3) ALU: execution or
calculate a memory address, (4) MEM: access an operand in data memory, and (5)
WB: write a result hack into the register. Label one appropriate stage in which
each of the following actions needs to be executed. (Note that A and B are two
source operands, while ALUOut is the output register of the ALU, PC is the
program counter, IR is the instruction register. MDR is the memory data register,
Memory[k] is the k-th word in the memory, and Reg[k] is the k-th registers in the
register file.)
(a) Reg[IR[20-16]] = MDR;
(b) ALUOut = PC + (sign-extend (IR[15-0]) << 2);
(c) Memory[ALUOut] = B;

Answer:
(a) WB
(b) ID

47
(c) MEM

48
95 年清大電機

1. (1) Can you come up with a MIPS instruction that behaves like a NOP? The
instruction is executed by the pipeline but does not change any state.
(2) In a MIPS computer a main program can use "jal procedure address" to make a
procedure call and the callee can use "jr $ra" to return to the main program.
What is saved in register $ra during this process?
(3) Name and explain the three principle components that can be combined to
yield runtime.

Answer:
(1) sll $zero, $zero, 0
(2) The address of the instruction following the jal (Return address)
(3) Runtime = instruction count  CPI (cycles per instruction) clock cycle time

2. (1) Briefly explain the purpose of having a write buffer in the design of a
write-through cache.
(2) Large cache block tends to decrease cache miss rate due to better spatial
locality. However, it has been observed that too large a cache block actually
increases miss rate. Especially in a very small cache. Why?

Answer:
(1) After writing the data into the write buffer, the processor can continue
execution without wasting time to wait the memory update. The CPU
performance can thus be increased.
(2) The number of blocks that can be held in the cache will become small, and
there will be a great deal of competition for those blocks. As a result, a block
will be bumped out of the cache before many of its words are accessed.

3. (1) Dynamic branch prediction is often used in today's machine. Consider a loop
branch that branches nine times in a row, and then is not taken once. What is
the prediction accuracy for this branch, assuming a simple 1-bit prediction
scheme is used and the prediction bit for this branch remains in the prediction
buffer? Briefly explain your result.
(2) What is the prediction accuracy if a 2-bit prediction scheme is used? Again
briefly explain your result.

Answer:
(1) The steady-state prediction behavior will mispredict on the first and last loop
iterations. Mispredicting the last iteration is inevitable since the prediction bit
will say taken. The misprediction on the first iteration happens because the bit
is flipped on prior execution of the last iteration of the loop, since the branch
was not taken on that exiting iteration. Thus, the prediction accuracy for this
branch is 80% (two incorrect predictions and eight correct ones).

49
(2) The prediction accuracy if a 2-bit prediction scheme is 90%, since only the last
loop iteration will be mispredict.

4. Answer the following questions briefly:


(1) In a pipelined CPU design, what kind of problem may occur as it executes
instructions corresponding to an if-statement in a C program? Name one
possible scheme to get around this problem more or less.
(2) Consider the possible actions in the Instruction Decode stage of a pipelined
CPU. In addition to setting up the two input operands of ALU, what is the
other possible action? (Hint: consider the execution of a jump instruction)
(3) What is x if the maximum number of memory words you can use in a 32-bit
MIPS machine in a single program is expressed as 2x? (Note: MIPS uses a
byte addressing scheme.)

Answer:
(1) Control hazard.
Solution: Insert Nop instruction, delay branch, branch prediction
(2) Decode instruction, sign-extend 16 bits immediate constant, jump address
calculation, branch target calculation, register comparison, load-use data
hazard detection.
(3) A single program in 32-bit MIPS machine can use 256 MB = 228 Bytes = 226
words. So, x = 26.

5. Consider the following flow chart of a sequential multiplier. We assume that the
64-bit multiplicand register is initialized with the 32-bit original multiplicand in
the right half and 0 in the left half. The final result is to be placed in a product
register. Fill in the missing descriptions in blanks A and B.
start

Multiplier[0] = 1 Multiplier[0] = 0
Test multiplier[0]

Blank A

Shift the multiplicand register left by 1 bit

Blank B

32nd repetition?

Yes: 32 repetitions No: <32 repetitions


Done

50
Answer:
Blank A: add Multiplicand to product and place the result in the Product register
Blank B: shift the Multiplier register right 1 bit

6. Schedule the following instruction segment into a superscaler pipeline for MIPS.
Assume that the pipeline can execute one ALU or branch instruction and one data
transfer instruction concurrently. For the best, the instruction segment can be
executed in four clock cycles. Fill in the instruction identifiers into the table. Note
that data dependency should be taken into account.
(Identifier) (Instruction)
ln-1 Loop: lw $t0, 0($s1)
ln-2 addu $t0, $t0, $s2
ln-3 sw $t0, 0($s1)
ln-4 addi $s1, $s1, −4
ln-5 bne $s1, $zero, Loop

Clock Cycle ALU or branch instruction Data transfer instruction


1
2
3
4

Answer:
Clock Cycle ALU or branch instruction Data transfer instruction
1 ln-1 (lw)
2 ln-4 (addi)
3 ln-2 (addu)
4 ln-5 (bne) ln-3 (sw)

7. Suppose a computer's address size is k bits (using byte addressing), the cache size
is S bytes, the block size is B bytes and the cache is A-way set-associative.
Assume that B is a power of two, so B = 2b. Figure out what the following
quantities are in terms of S, B, A, b and k:
(1) the number of sets in the cache
(2) the number of index bits in the address
(3) the number of bits needed to implement the cache

Answer:
Address size: k bits
Cache size: S bytes/cache
Block size: B = 2b bytes/block

51
Associativity: A blocks/set
(1) Number of sets in the cache = S/AB;
 S  S
(2) Number of bits for index = log 2    log 2    b
 AB   A
 S  S
(3) Number of bits for the tag = k   log 2    b   b  k  log 2  
  A   A
Number of bits needed to implement the cache
= sets/cache × associativity × (data + tag + valid)
S  S  S  S 
=  A   8  B  k  log 2    1    8  B  k  log 2    1 bits
AB   A  B   A 

8. To compare the maximum bandwidth for a synchronous and an asynchronous bus,


assume that the synchronous bus has a clock cycle of 50 ns, and each bus
transmission takes 1 clock cycle. The asynchronous bus requires 40 ns per
handshake and the asynchronous handshaking protocol consists of seven steps to
read a word from memory and receive it in an I/O device as shown below. The
data portion of both buses is 32 bits wide. Find the bandwidth for each bus in
MB/sec when performing one-word reads from a 200-ns memory.

ReadReq 1
3
Data
2 4
2 6

4
Ack
5
7
DataRdy

Answer:
(1) For the synchronous bus, which has 50-ns bus cycles. The steps and times
required for the synchronous bus are as follows:
1. Send the address to memory: 50 ns
2. Read the memory: 200 ns
3. Send the data to the device: 50 ns
Thus, the total time is 300 ns.
The maximum bus bandwidth = 4 bytes/300ns = 13.3 MB/second
(2) For the asynchronous bus, the memory receives the address at the end of step
1 and does not need to put the data on the bus until the beginning of step 5;
step 2, 3, and 4 can overlap with the memory access time. This leads to the
following timing:
Step 1: 40 ns
Step 2, 3, 4: maximum (3  40 ns, 200 ns) = 200 ns

52
Step 5, 6, 7: 3  40 ns = 120 ns
Thus, the total time is 360 ns.
The maximum bus bandwidth = 4 bytes/360ns = 11.1 MB/second

9. Bus arbitration is needed in deciding which bus master gets to use the bus next
in a computer system. There are a wide variety of schemes for bus arbitration;
these may involve special hardware or extremely sophisticated bus protocols. In
a bus arbitration scheme, a device (or the processor) wanting to use the bus
signals a bus request and is later granted the bus. After a grant, the device can
use the bus, later signaling to the arbiter that the bus is no longer required. The
arbiter can then grant the bus to another device. Most multiple-master buses
have a set of bus lines for performing bus requests and grants. A bus release line
is also needed if each device does not have its own request line. Sometimes the
signals used for bus arbitration have physically separate lines, while in other
systems the data lines of the bus are used for this function. Arbitration schemes
usually try to balance two factors in choosing which device to grant the bus,
namely, the priority and the fairness. In general, bus arbitration schemes can be
divided into four broad classes. What are those four classes? Briefly explain
those four classes of bus arbitration schemes.

Answer:
1. Daisy chain arbitration: the bus grant line is run through the device from
highest priority to lowest
2. Centralized, parallel arbitration: use multiple request lines and a centralized
arbiter chooses from among the devices request the bus (PCI backplane bus)
3. Distributed arbitration by self-selection: Each device wanting the bus places a
code indicating its identity on the bus. (Apple Macintosh II Nubus)
4. Distributed arbitration by collision detection: Each device independently
requests the bus. The collision is detected when multiple simultaneous
requests occur. (Ethernet)

53
94 年清大電機

1. How many addressing modes are used in the following MIPS code? Please select
at least one instruction from the assembly code to explain different addressing
modes.
Addressing mode examples

search: $v1, 0($v0)


sw $fp, 20($sp) lw $v0, 32($fp)
sw $gp,16($sp) bne $v1, $v0, $L4
move $fp, $sp lw $v1, 8($fp)
sw $a0, 24 ($fp) move $v0, $v1
sw $al, 28($fp) j $L1
sw $a2, 32 ($fp) $L4:
sw $zero, 8($fp) lw $v0, 8($fp)
$L2: addu $v1, $v0, 1
lw $v0, 8($fp) sw $vl, 8 ($fp)
lw $v1, 28($fp) j $L2
slt $v0, $v0, $v1 $L3:
bne $v0, $zero, $L5 li $v0, -1
j $L3 j $L1
$L5: $L1:
lw $v0, 8($fp) move $sp, $fp
move $v1, $v0 lw $fp, 20 ($sp)
sll $v0,v1, 2 addu $sp, $sp, 24
lw $v1, 24($fp) j $ra
addu $v0, $v0, $v1 .end search

Answer:
Addressing Modes Example instruction
(1) Register addressing slt $v0, $v0, $v1
(2) Base or displacement addressing lw $v0, 32($fp)
(3) Immediate addressing li $v0, -1
(4) PC-relative addressing bne $v0, $zero, $L5
(5) Pseudodirect addressing j $L1

2. Answer the following two yes or no questions about the MIPS assembly language.
a. Is it true that instruction ―slt $s1, $s2, $s3‖ will set $s1 to 1 if $s2 is less than
$s3?
b. Is it true that the so-called jump and link instruction, e.g., jal 2500, is mainly to
support the return action from a procedure call to its caller function?

54
Answer:
a. Yes
b. No. (it jump from caller to callee and simultaneously saves the address of the
following instruction in register $ra)

3. Consider a pipelined CPU design with the 5 stages being (1) instruction fetch, (2)
decoding, (3) execution, (4) memory access, and (5) register write back.
(a) List all instruction stages in which we need to read or write the register file.
(b) What is the minimum number of IO ports required for the register file if the
access of the register file takes one full clock cycle? (Note: an IO port can be
used for either a read or write operation.)
(c) What is the minimum number of IO ports required for the register file if the
access of the register file takes only half a clock cycle? (Note: we will be able
to perform two accesses to or from the register files within one clock cycle.)

Answer:
(a) decoding and register write back stages
(b) 3
(c) 2

4. Consider the following summary table for execution steps that need to be
performed by four major instruction classes: (1) arithmetic-logic (or called
R-type), (2) memory-reference, (3) branch, and (4) jump instructions.
(a) Complete the missing action in entry marked ―?‖ under the column for the
memory-reference instructions.
(b) What is the instruction class of entry marked ?
Memory-
Execution Step R-type  
Reference
Instruction
IR-Memory[PC]; PCPC+4;
Fetch
A  Reg[IR[25-21]];
Instruction
B  Reg[IR[20-16]];
Decode ALUOut  PC + (sign-extend(IR[15-0] << 2);
ALUOut  A + PC  PC[31-
ALUOut  A op If (A= =B) then PC
Execution B;
sign-extend 28]||(IR[25- 0] <<
 ALUOut;
(IR[15-0]); 2);
Load: MDR (?);
Memory Access Store: omitted;
Register Write Reg[IR[15- 11]]  Load: Reg[IR[20-
Back ALUOut; 16]]  MDR;

Answer:
(a) memory[ALUOut]
(b) jump instruction

55
5. Answer the following questions:
a) Explain (please also draw a diagram) the following methods how can they
resolve the multiple simultaneously interrupt requests from the I/O Devices? (1)
Daisy chain, (2) Polling
b) Explain the following two modes, (1) cycle stealing and (2) block mode, for
DMA controller to transfer the data from I/O device to Memory. Which mode is
transparent (unknown) to CPU operation? Why?
c) Now suppose that CPU is executing a maximum of 106 instructions/sec. An
average instruction execution requires five machine cycles, three of which use
the memory bus, a memory read/write uses one machine cycle for transferring
one word. What is DMA transfer rate (word/sec) for the above two DMA
controller modes?

Answer:
(a) (1) Daisy chain: 如下圖所示,若多個 I/Odevices同時提出中斷要求時,以越
靠近CPU的devices其中斷優先順序越高。

Device 1 Device N
Highest Device 2 Lowest
Priority Priority

Grant Grant Grant


Bus Release
Arbiter Request

wired-OR

(2) polling: 如下圖所示,CPU定期的去檢查各個I/O devices的status register


看看哪個device需要服務。詢問的順序便隱含著各個device中斷的優先順

Device
1

Device
CPU
2

Device
n

(b) (1) Cycle stealing: DMA steals memory cycles from CPU, transferring one or a
few words at a time before returning control.
(2) Block mode: an entire block is transferred in a single continuous burst–
needed for magnetic-disk drives etc. Where data transmission cannot be

56
stopped or slowed down without loss of data.
So, cycle stealing is transparent to CPU operation.
(c) Cycle stealing: (5 – 3)  106 = 2  106 words/second
DMA block transfer: 5  106 words/second

6. The interrupt breakpoint or the DMA breakpoint is the instance when the CPU
responds to the interrupt request (INTR) or DMA request; If the CPU is executing
an instruction (several micro-steps or machine cycles are required for executing
each instruction in one instruction cycle).
(1) Where the interrupt breakpoint occurs?
(2) Where the DMA breakpoint occurs?
(3) After receiving the INTA signal from CPU, how the I/O device identifies
itself to CPU?

Answer:
(1) 指令結束執行的 machine cycles 都可以是 interrupt breakpoint
(2) 每一個 machine cycles 都可以是 DMA breakpoint
(3) An interrupting device identifies itself to the CPU by:
 Sending the address of the interrupt handling routine (vector interrupt)
 Putting an identifier in a Cause Register
Instruction Cycle

Processor Processor Processor Processor Processor Processor


Cycle Cycle Cycle Cycle Cycle Cycle

Fetch Decode Fetch Execute Store Processor


instruction instruction operand instruction result interrupt

DMA Interrupt
breakpoint breakpoint

57
93 年清大電機

1. Consider the representations of the floating-point numbers.


(a) A number is often denoted as (-1)S × F × 2E. What are the English names and
meanings of F, and E, respectively?
(b) In the IEEE 754 standard format, a number is denoted as (-1)S × (1.F) × 2(E-Bias).
For single and double precision numbers, what are the values of Bias,
respectively?

Answer:
(a) F: Fraction (or significand) 表示有效數字
E: Exponent 表示指數
Double precision bias: 1023

2. Explain the meaning of each of the following MIPS instructions using an


if-statement. Denote the program counter as PC when needed. Note that an
instruction has four bytes.
slt $s1, $s2, $s3
slti $s1, $s2, 100
bne $s1, $s2, 25

Answer:
(1) if $s2 < $s3 then $s1 = 1 else $s1 = 0
(2) if $s2 < 100 then $s1 = 1 else $s1 = 0
(3) if $s1 ≠ $s2 then goto (PC + 4) + 100

3. In a 5-staged pipelined computer, 20% of the instructions are assumed to be


branch instructions that could cause one-cycle pipeline stalls, if not properly
handled.
(a) Ignoring all other hazards, what is the CPI of this computer by taking into
account this branch-related control hazard?
(b) If the probability of a branch instruction being taken is 30% on the average,
then what is the average CPI under the ―predict-not-taken‖ branch prediction
scheme?

Answer:
(a) CPI = 1 + 0.2 × 1 = 1.2
(b) CPI = 1 + 0.2 × 0.3 × 1 = 1.06

58
4. Consider a computer system with a cache of 4K blocks, a four-word block size, a
4-byte word size, and a 32-bit address.
(1) What are the total number of sets and the total number of tag bits for caches
that are  direct-mapped,  two-way set associative,  four-way set
associative, and  fully associative?
(2) Draw 4 diagrams of the 32-bit address for above four type of caches and
indicate in each diagram which bit fields are used for tags, index to block,
block offset, etc., respectively
(3) What block number does byte address 1200 map to in the four types of caches,
respectively?

Answer:
(1)(2) 4K block  index = 12 bit; 4-word block  block offset = 2 bits;
4-byte word = 2 bits byte offset
 direct-mapped: total tag = (32 – 12 – 2 – 2) × 4K = 16 × 4K = 64K bits
16 bits 12 bits 2 bits 2 bits
Block Byte
Tag Index
offset offset
 two-way set associative: total tag = (16 + 1) × 2K × 2 = 68K bits
17 bits 11 bits 2 bits 2 bits
Block Byte
Tag Index
offset offset
 four-way set associative: total tag = (17 + 1) × 1K × 4 = 72 K bits
18 bits 10 bits 2 bits 2 bits
Block Byte
Tag Index
offset offset
 fully associative: total tag = (32 – 0 – 2 – 2) × 1 × 4K = 112K bits
28 bits 2 bits 2 bits
Block Byte
Tag
offset offset
(3) block address = 1200 / 16  75
 direct-mapped: block number 75
 two-way set associative: block number 75, set number 75
 four-way set associative: block number 75, set number 75
 fully associative: block number 75, set number 0

59
5. Suppose we have a processor with a base CPI (clock-cycle per instruction) of 1.0,
assuming all reference hit in the primary cache, and a clock rate of 500 MHz.
Assume a main memory access time of 100 ns, including all the miss handling.
Suppose the miss rate per instruction at the primary cache is 5%.
(1) What is the miss penalty to main memory in clock cycles and the effective
CPI for this one-level caching processor?
(2) What will the effective CPI and how much faster will the machine be if we
add a secondary cache that has a 10-ns access time for either a hit or a miss
and the secondary cache is large enough to reduce the miss rate to main
memory to 2%?

Answer:
(1) CPU clock cycle time = 1 / 500 MHz = 2 ns
Miss penalty for main memory = 100 / 2 = 50 clock cycles
CPI = 1 + 50 × 0.05 = 3.5
(2) Miss penalty for second level cache = 10 / 2 = 5 clock cycles
CPI = 1 + 0.05 × 5 + 0.02 × 50 = 2.25
Speedup = 3. 5 / 2.25 = 1.56

6. (1) Explain the purpose of jump-and-link (jal) instruction.


(2) Explain why most of today‘s computer system use 2‘s complement instead of
signed-magnitude in their hardware implementations.
(3) Explain why geometric mean may be useful in comparing machine
performance.
(4) Power of 2 is normally used in the design of a computer. Is it possible to
construct a five-way set associatively cache? Why?
(5) Is MIPS (million instructions per second) an accurate measure for comparing
performance of different architecture? Why?

Answer:
(1) jal 指令是用在程序的呼叫,這個指令同時執行兩個動作:一為跳至被呼
叫程式之起始位址;另一為將 PC 的值(即返回位址)存於$ra 暫存器中。
(2) signed-magnitude 有以下缺點:  0 的表示方式有正負之分,對於粗心的
programmer 而言容易導致程式執行時的錯誤。執行加法時需要一額外
步驟來決定運算結果為正或負。 sign bit 位置的決定會造成設計者的困
擾。
(3) The geometric mean is independent of which data series we use for
normalization because it has the property
Geometric mean ( X i ) X 
 Geometric mean  i 
Geometric mean (Yi )  Yi 

60
(4) YES。Five-way 只是表示一個集合有 5 個 blocks,只要集合總數及 block
中 byte 的個數皆為 2 的冪次方則 set associatively cache 便能 work。
(5) NO, since there are 3 problems with using MIPS:
 MIPS specifies the instruction execution rate but does not take into
account the capabilities of the instructions
 MIPS varies between programs on the same computer
 MIPS can vary inversely with performance

7. How will you fill in a personal record such as ―Tom Lien‖ in the following table
using little-endian? Assume each row consists of 4 bytes.

Answer:
m o T
n e i L

8. The following figure is a 32-bit ALU constructed from 32 1-bit ALUs. CarryOut
of the less significant bit is connected to the CarryIn of the more significant bit.
Can you add a simple logic to detect if there is an overflow?
CarryIn Operation

CarryIn
a0 ALU0 Result0
b0 CarryOut

CarryIn
a1 ALU1 Result1
b1 CarryOut

CarryIn
a2 ALU2 Result2
b2 CarryOut

CarryIn
a31 ALU31 Result31
b31 CarryOut

Answer:
Overflow = CarryIn[31] XOR CarryOut[31]

61
CarryIn Operation

CarryIn
a0 ALU0 Result0
b0 CarryOut

CarryIn
a1 ALU1 Result1
b1 CarryOut

CarryIn
a2 ALU2 Result2
b2 CarryOut

a31 CarryIn
ALU31 Result31
b31 CarryOut

Overflow

62
92 年清大電機

1. Compute the value of the following floating-point number A based on the IEEE
standard. (Note: this floating-point number is composed of three fields, i.e., a sign
bit, 8 exponent bits, and 23 significant bits)
A = (11000000101000000000000000000000)

Answer:
(−1)1 × (1.01)two × 2129-127 = −101two = −5

2. Consider the addition process of the following two binary numbers A and B.
Determine the so-called carry generate and carry propagate, signals for each bit.
A = (00011010)
B = (11100101)

Answer:
A = 0 0 0 1 1 0 1 0
B = 1 1 1 0 0 1 0 1
pi 1 1 1 1 1 1 1 1
gi 0 0 0 0 0 0 0 0

3. Consider the process of adding up two floating-point numbers in a


microprocessor.
(1) Derive the proper sequence of the following three operations.
 Addition of significands
 Normalization
 Alignment of Exponents
(2) What operation is still needed in addition to the above three operations.

Answer:
(1) 
(2) Round the sum

4. One extension of the MIPS instruction set architecture has two new instructions
called movn (move if not zero) and movz (move if zero). For example, the
instruction
movn $8, $11, $4
copies the contents of register 11 into register 8, provided that the value in
register 4 is nonzero (otherwise it does nothing). The movz instruction is similar
but copying takes place only if the register‘s value is zero. Show how to use the
new instructions to put whichever is larger, register 8‘s value or register 11‘s
value, into register 10. If the values are equal, copy either into register 10. You

63
may use register 1 as an extra register for temporary use. Do not use any
conditional branches.

Answer:
slt $1, $8, $11
movn $10, $11, $1
movz $10, $8, $1

5. Consider three machines with different cache configurations:


Cache 1:Direct-mapped with one-word blocks.
Cache 2:Direct-mapped with four-word blocks.
Cache 3:Two-way set associative with four-word blocks.
The following miss measurements have been made:
Cache 1:Instruction miss rate is 4%, data miss rate is 8%.
Cache 2:Instruction miss rate is 2%, data miss rate is 5%.
Cache 3:Instruction miss rate is 2%, data miss rate is 4%.
For these machines, one-half of the instructions contain a data reference. Assume
that the cache miss penalty is 6 + Block size in words. The CPI for this workload
was measured on a machine with cache 1 and found to be 2.0. (1) Determine
which machine spends the most cycles on cache misses. (2) The cycle time for the
three machines are 10 ns for the first and second machines and 12 ns for the third
machine. Determine which machine is the fastest and which is the slowest.

Answer:
(1) C1
Cache Miss penalty I cache miss D cache miss Total Miss
C1 6+1=7 4% × 7 = 0.28 8% × 7 = 0.56 0.28 + 0.56/2 = 0.56
C2 6 + 4 = 10 2% × 10 = 0.2 5% × 10 = 0.5 0.2 + 0.5/2 = 0.45
C3 6 + 4 = 10 2% × 10 = 0.2 4% × 10 = 0.4 0.2 + 0.4/2 = 0.4
(2) We need to calculate the base CPI that applies to all three processors. Since
we are given CPI = 2 for C1, CPIbase = CPI – CPImisses = 2 – 0.56 = 1.44
Execution Time for C1 = 2 × 10 ns × IC = 20 × 10-9 × IC
Execution Time for C2 = (1.44 + 0.45) × 10 ns × IC = 18.9 × 10-9 × IC
Execution Time for C3 = (1.44 + 0.4) ×12 ns × IC = 22.1 × 10-9 × IC
Therefore C2 is fastest and C3 is slowest.

6. A program repeatedly performs a three-step process: It reads in a 4-KB block of


data from disk, does some processing on the data, and then writes out the result as
another 4-KB block elsewhere on the disk. Each block is contiguous and
randomly located on a single track on the disk. The disk drive rotates at
7200RPM, has an average seek time of 8 ms, and has a transfer rate of 20 MB/sec.
The controller overhead is 2 ms. No other program is using the disk processor,

64
and there is no overlapping of disk operation with processing. The processing step
takes 20 million clock cycles, and the clock rate is 400 MHz. What is the overall
speed of the system in blocks processed per second?

Answer:
(Seek time + Rotational delay + Data transfer time + Control time) × 2 +
processing time = (8 ms + 0.5/(7200/60) sec + 4K/20M + 2 ms) × 2 +
(20×106)/(400×106) = (8 + 4.17 + 0.2 + 2) × 2 + 50 = 78.37 ms
Block processed/second = 1/78.37 ms = 12.76

7. Suppose register $s0 has the binary number


1111 1111 1111 1111 1111 1111 1111 1111two
and that register $s1 has the binary number
0000 0000 0000 0000 0000 0000 0000 0000two
What are the values of registers $t0 and $t1 after these two instructions?
slt $t0, $s0, $s1 # set on less than signed comparison
sltu $t1, $s0, $s1 # set on less than unsigned comparison

Answer:
(1) $t0 = 1
(2) $t1 = 0

8. Explain
(1) spatial locality
(2) write-back cache
(3) page table
(4) compulsory misses
(5) branch delay slot

Answer:
(1) The locality principle stating that if a data location is referenced, data
locations with nearby addresses will tend to be referenced soon.
(2) A scheme that handles writes by updating values only to the block in the
cache, then writing the modified block to the lower level of the hierarchy
when the block is replaced.
(3) The table containing the virtual to physical address translations in a virtual
memory system. The table, which is stored in memory, is typically indexed by
the virtual page number; each entry in the table contains the physical page
number for that virtual page if the page is currently in memory.
(4) Also called cold start miss. A cache miss caused by the first access to a block
that has never been in the cache.
(5) The slot directly after a delayed branch instruction, which in the MIPS
architecture is filled by an instruction that does not affect the branch.

65
9. Which change is more effective on a certain machine: speeding up 10-fold the
floating point square root operation only, which takes up 20% of execution time,
or speeding up 2-fold all other floating point operations, which take up 50% of
total execution time? Assume that the cost of accomplishing either change is the
same, and the two changes are mutually exclusive.

Answer:
1
Speedup1= = 1.22
1  0.2  0.2
10
1
Speedup2= 0.5 = 1.33
1  0.5 
2
So, speeding up 2-fold all other floating point operations is more effective

66
96 年清大通訊

1. Here is a series of address reference given as word addresses: 1, 4, 8, 5, 20, 17, 19,
56, 9, 11, 4, 43, 5, 6, 9, 17. Show the hits and misses and final cache contents for
a direct-mapped cache with four-word blocks and a total size of 16 words.

Answer:
Referenced
Block
Address Tag Index Hit/Miss
address
(decimal)
1 0 0 0 Miss
4 1 0 1 Miss
8 2 0 2 Miss
5 1 0 1 Hit
20 5 1 1 Miss
17 4 1 0 Miss
19 4 1 0 Hit
56 14 3 2 Miss
9 2 0 2 Miss
11 2 0 2 Hit
4 1 0 1 Miss
43 10 2 2 Miss
5 1 0 1 Hit
6 1 0 1 Hit
9 2 0 2 Miss
17 4 1 0 Hit

index contents
0 16, 17, 18, 19
1 4, 5, 6, 7
2 8, 9, 10, 11
3

67
2. The following program tries to copy words from the address in register $a0 to the
address in register $a1 and count the number of words copied in register $v0. The
program stops copying when it finds a word equal to 0. You do not have to
preserve the contents of registers $v1, $a0, and $a1. This terminating word should
be copied but not counted.
Loop: lw $v1, 0($a0) # read next word from source
addi $v0, $v0, 1 # increment count words copied
sw $v1, 0($a0) # write to destination
addi $a0, $a0, 1 # advance pointer to the next source
addi $a1, $a1, 1 # advance pointer to the next
destination
bne $v1, $zero, loop # loop if word copied is not zero
There are multiple bugs in this MIPS program. Please fix them and turn in
bug-free version.

Answer:
addi $v0, $zero, -1
Loop: lw $v1, 0($a0)
addi $v0, $v0, 1
sw $v1, 0($a1)
addi $a0, $a0, 4
addi $a1, $a1, 4
bne $v1, $zero, Loop

68
95 年清大通訊

1. Assume the critical path of a new computer implementation is memory access for
loads and stores. This causes the design to run at a clock rate of 500MHz instead
of the target clock rate of 750MHz. What is the solution with minimum
multi-cycle path to make the machine run at its targeted clock rate? Using the
table shown below, determine how much faster of the approach used on the
previous answer is compared with the 500 MHz machine with single-cycle
memory access. Assume all jumps and branches take the same number of cycles
and that the set instructions and arithmetic immediate instructions are
implemented as R-type instructions.
Instruction class Mix frequency Cycles per instruction on 500MHz machine
Loads 22% 5
Stores 11% 4
R-Type 49% 4
Jmup/brach 18% 3

Answer:
(1) If the execution of memory access can be divided into two clock cycles, the
machine may run at its targeted clock rate of 750MHz.
(2) CPI for single-cycle memory access machine =
5  0.22 + 4  0.11 + 4  0.49 + 3  0.18 = 4.04
For multi-cycle memory access machine, the CPI for load is 5 + 1 = 6 and for
store is 4 + 1 = 5
CPI for multi-cycle memory access machine = 6  0.22 + 5  0.11 + 4  0.49
+ 3  0.18 = 4.37
The average instruction execution for single-cycle memory access machine =
4.04  2 ns = 8.08 ns
The average instruction execution for multi- cycle memory access machine =
4.37  1.3 ns = 5.81 ns
The machine with multi-cycle memory access is 8.08/5.81 = 1.39 times faster
than the machine with single-cycle memory access.

2. A C procedure that swaps two locations in memory is shown below:


swap(intv[], int k)
{int temp;
temp = v[k];
v[k] =v[k + l];
v[k + 1] =temp;
}
(1) Find the hazard in the following code from the body of the swap procedure.
(2) Reorder the instructions to avoid as many pipelines stalls as possible.

69
# reg $2 has the address of v[k]
lw $15, 0($2) # reg $15(temp) = v[k]
lw $16, 4($2) # reg$16 = v[k + l]
sw $16, 0($2) # v[k] = reg $16
sw $15, 4($2) # v[k + l] = reg$15(temp)

Answer:
(1) there is a data hazard for register $16 between the second load word
instruction and the first store word instruction.
(2)
lw $15, 0($2)
lw $16, 4($2)
sw $15, 4($2)
sw $16, 0($2)

3. Bus A is a bus with separate 32-bit address and 32-bit data. Each transmission
takes one bus cycle. A read to the memory incurs a three-cycle latency, then
starting with the fourth cycle, the memory system can deliver up to 8 words at a
rate of 1 word every bus cycle. For a write, the first word is transmitted with the
address; after a three-cycle latency up to 7 additional words may be transmitted at
the rate of 1 word every bus cycle. Evaluate the bus assuming only 1 word
requests where 60% of the requests are reads and 40% are writes. Find the
maximum bandwidth that each bus and memory system can provide in words per
bus cycle.

Answer:
The latency for reading 8 words = 3 + 8 = 11
The maximum bandwidth for read = 8/11 (words/cycles)
The latency for writing 7 words = 1 + 3 + 7 = 11
The maximum bandwidth for write = 7/11 (words/cycles)
The maximum bandwidth that each bus and memory system can provide =
(8/11)  0.6 + (7/11)  0.4 = 0.69 (words/cycles)

70
94 年清大通訊

1. Here is a string of address references given as word addresses: 1, 4, 8, 5, 20, 17,


19, 56, 9, 11, 4, 43, 5, 6, 9, 17. Show the hits and misses and final cache contents
for a two-way set associative cache with one-word blocks and a total size of 16
words. Assume LRU replacement.

Answer: length of offset = 0 bit, length of index = 3 bits


Referenced Referenced Contents
Address Address Tag Index Hit/Miss
(decimal) (Binary) Block0 Block1
1 000001 0 1 Miss 1
4 000100 0 4 Miss 4
8 001000 1 0 Miss 8
5 000101 0 5 Miss 5
20 010100 2 4 Miss 4 20
17 010001 2 1 Miss 1 17
19 010011 2 3 Miss 19
56 111000 7 0 Miss 8 56
9 001001 1 1 Miss 9 17
11 001011 1 3 Miss 19 11
4 000100 0 4 Hit 4 20
43 101011 5 3 Miss 43 11
5 000101 0 5 Hit 5
6 000110 0 6 Miss 6
9 001001 1 1 Hit 9 17
17 010001 2 1 Hit 9 17

Set Block0 Block1


0 8 56
1 9 17
2
3 43 11
4 4 20
5 5
6 6
7

71
2. Consider three machines with different cache configurations:
• Cache 1: Direct mapped with one-word blocks.
• Cache 2: Direct mapped with four-word blocks.
• Cache 3: 2-way set associative with four-word blocks.
The following miss rate measurements have been made:
• Cache 1: Instruction miss rate is 4%; data miss rate 8%.
• Cache 2: Instruction miss rate is 2%; data miss rate 5%.
• Cache 3: Instruction miss rate is 2%; data miss rate 4%.
For these machines, one-half of the instructions contain a data reference. Assume
that the cache miss penalty is 6 + Block size in words. The CPI for this workload
was measured on a machine with cache 1 and was found to be 2.0. Answer the
following two questions.
(1) Determine which machine spends the most cycles on cache misses.
(2) The clock rates for the three machines are 10 ns for the first and second
machines and 12 ns for the third machine. Determine which machine is the
fastest and which is the slowest.

Answer:
(1) C1
Cache Miss penalty I cache miss D cache miss Total Miss
C1 6+1=7 4% × 7 = 0.28 8% × 7 = 0.56 0.28 + 0.56/2 = 0.56
C2 6 + 4 = 10 2% × 10 = 0.2 5% × 10 = 0.5 0.2 + 0.5/2 = 0.45
C3 6 + 4 = 10 2% × 10 = 0.2 4% × 10 = 0.4 0.2 + 0.4/2 = 0.4
(2) We need to calculate the base CPI that applies to all three processors. Since
we are given CPI = 2 for C1, CPIbase = CPI – CPImisses = 2 – 0.56 = 1.44
Execution Time for C1 = 2 × 10 ns × IC = 20 × 10-9 × IC
Execution Time for C2 = (1.44 + 0.45) × 10 ns × IC = 18.9 × 10-9 × IC
Execution Time for C3 = (1.44 + 0.4) ×12 ns × IC = 22.1 × 10-9 × IC
Therefore C2 is the fastest and C3 is the slowest.

72
93 年清大通訊

1. How does DMA increase system concurrency? How does it complicate hardware
design?

Answer:
(1) DMA increases system concurrency by freeing the CPU to perform other
tasks while it handles data transfer to/from the disk.
(2) The hardware design of a system with DMA is complicated because a special
DMA controlled must be integrated into the system so that DMA and normal
CPU operations can coexist.

2. There are six relative conditions between the values of two registers. In this
problem we consider two of them. Assuming that variable i corresponds to
register $19 and variable j to $20, show the MIPS code for the conditions
corresponding to the following two C codes:
(1) if (i == j) goto L1;
(2) if (i < j) goto L1;

Answer:
(1) beq $19, $20, L1
(2) slt $at, $19, $20
bne $at, $zero, L1

3. Consider a virtual memory system with the following properties:


(1) 40 bit virtual address
(2) 16 KB pages
(3) 36-bit physical address
Assume that the valid, protection, dirty, and use bits take a total of 4 bits and that
all the virtual pages are in use. Assume that disk addresses are not stored in the
page table. What is the total size of the page table for each process on this
machine?

Answer:
Each page is 16 KB  14 bits page offset.
The bits of virtual page number = 40 – 14 = 26 bits  226 entries in the page table
Each entry requires 36 – 14 = 22 bits to store the physical page number and an
additional 4 bits for the valid, protection, dirty, and use bits. We round the 26 bits
up to a full word per entry, so this gives us a total size of 226 × 32 bits or 256 MB.

73
4. Assume that a hard disk in a computer transfers data in one-work chunks and can
transfer at 2 MB/sec. Assume that no transfer can be missed. Assume that the
number of clock cycles for polling operation is 100 and that the processor
executes with a 50 MHz clock. Determine the fraction of CPU time consumed by
the hard disk assuming that you poll often enough so that no data is ever lost.

Answer:
We must poll at a rate equal to the data rate in one-word chunks, which is 500K
times per second (2 MB per second/4 bytes per transfer). Thus,
Cycles per second for polling = 500K × 100. Ignoring the discrepancy in bases,
Fraction of the processor consumed = (50 × 106) / (50 × 106) = 100%

74
92 年清大通訊

1. A program runs in 10 seconds on computer A, which has a 100 MHz clock. We


are trying to help a computer designer build a machine B that will run this
program in 6 seconds. The designer has determined that a substantial increase in
the clock rate is possible, but this increase will affect the rest of the CPU design,
causing machine B to require 1.2 times as many clock cycles as machine A for
this program. What clock rate should design for machine B?

Answer:
Cycles for executing in computer A = 10  100 MHz = 109
Cycles needed for executing in computer B = 1.2  109
Suppose the clock rate for the computer B is R, then 1.2  109 / R = 6
 R = 200 MHz

2. Use Booth's algorithm to compute 2ten  −3ten.

Answer:
Iteration Step Multiplicand Product
0 Initial values 0010 0000 1101 0
10  Prod – Mcand 0010 1110 1101 0
1
Shift right rpoduct 0010 1111 0110 1
01  Prod + Mcand 0010 0001 0110 1
2
Shift right product 0010 0000 1011 0
10  Prod – Mcand 0010 1110 1011 0
3
Shift right product 0010 1111 0101 1
11  No operation 0010 1111 0101 1
4
Shift right product 0010 1111 1010 1

3. Given the bit pattern 1000 1111 1110 1111 1100 0000 0000 0000. What does it
represent, assuming that it is
(1) a two's complement integer?
(2) an unsigned integer?
(3) a single precision floating-point number?
(4) a MIPS instruction?

Answer:
(1) − (230 + 229 + 228 + 220 + 214)10
(2) (231 + 227 + 226 + 225 + 224 + 223 + 222 + 221 + 219 + 218 + 217 + 216 + 215 + 214)10
(3) − 1.1101111112  2−96
(4) lw $t7, C00016($ra)

75
4. Suppose you want to perform two sums: one is a sum of two scalar variables and
one is a matrix sum of a pair of two-dimensional arrays, size 1000 by 1000. What
speedup do you get with 1000 processors?

Answer:
Suppose T is the time required to add two variables
Execution Time for one processor = 1 + 1000 ×1000 = 1000001 (T)
Execution Time for 1000 processor = 1 + (1000 ×1000) / 1000 = 1001 (T)
So, speedup = 1000001(T)/1001(T) = 999

76
96 年清大資工

1. Finite State Machine (FSM) can be divided into two types, Moore machine and
Mealy machine. Please use an example to demonstrate that Mealy machine would
have glitches or spikes at the output.

Answer:
The output of the Mealy machine can change either when the state changes or
when the input changes. This may cause temporary false outputs to occur. These
temporary false outputs are referred to as glitches and spikes.
For example, in the following circuit, A is a signal directly from a primary input,
B is a signal from a state output, and C is a circuit output.

A
B C
D Q

If signal A come early than signal B, then there will be a glitch occurs at output C.
The following is the timing diagram for the previously Mealy machine.

A
B

C
Glitch

2. Your company uses a benchmark C to evaluate the performance of a computer A


used in your company. But the computer A can only execute integer instructions,
and it uses a sequence of integer instructions to emulate a single floating-point
instruction. The computer A is rated at 200 MIPS on the benchmark C. Now,
your boss would like to attach a floating-point coprocessor B to the computer A
such that the floating-point instructions can be executed by the coprocessor for
performance improvement. Note that, however, the combination of computer A
and the coprocessor B is rated only at 60 MIPS on the same benchmark C. The
following symbols are used in this problem:
I: the number of integer instructions executed on the benchmark C.
F: the number of floating-point instructions executed on the benchmark C.

77
N:the number of integer instructions to emulate a floating-point instruction.
Y:time to execute the benchmark C on the computer A alone.
Z: time to execute the benchmark C on the combination of computer A and the
coprocessor B.
a. Write an equation for the MIPS rating of computer A using the symbols
above.
b. Given I = 5  106, F = 5  105, N = 30, find Y and Z.
c. Do you agree with your boss from the performance point of view? Please
state the reasons to justify your answer.

Answer:
I FN
(a) MIPSA =
Y  10 6
I FN I FN 5  10 6  5  10 5  30
(b) MIPSA = Y= = =
Y  10 6 MIPS A  10 6
MIPS A  10 6
5  10 6  5  10 5  30
= 100 ms
200  10 6
IF IF 5  10 6  5  10 5
MIPSA+B = Z= = = 91.67 ms
Z  10 6 MIPS AB  10 6 60  10 6
(c) Yes. Although the MIPS of the processor/coprocessor combination seems to
be lower than that of the processor alone, that is not the case. This is clearly
seen from the execution times since it only takes 91.67 ms to execute the
program with the coprocessor present as opposed to the 100 ms seconds
without it.

3. Suppose that a computer's address size is 32 bits (using byte addressing), the
cache size is 32 Kbytes, the block size is 1-word, and the cache is 4-way set
associative. (a) what is the number of sets in the cache. (b) what is the total
number of bits needed to implement the cache. Please show your answer in the
exact total number of bits.

Answer:
(a) The number of blocks in cache = 32 Kbyte / 4 byte = 8K
The number of sets in the cache = 8K / 4 = 2K
(b) The length of index field = log2(2K) = 11 bits
The length of byte offset field = 2 bits
The length of tag field = 32 – 11 – 2 = 19
The size of the cache = 2K  (1 + 19 + 32)  4 = 416 Kbits

78
4. A virtual memory system often implements a TLB to speed up the
virtual-to-physical address translation. A TLB has the following characteristics.
Assume each TLB entry has a valid bit, a dirty bit, a tag, and the page number.
Determine the exact total number of bits to implement this TLB.
• It is direct-mapped
• It has 16 entries
• The page size is 4 Kbytes
• The virtual address space is 4 Gbytes
• The physical memory is 1 Gbytes

Answer:
The length of virtual page number = log2(4 Gbytes/4 K bytes) = 20 bits
The length of physical page number = log2(1 Gbytes/4 K bytes) = 18 bits
The index field = log216 = 4 bits
The tag field = 20 – 4 = 16 bits
The bits in each entry of TLB = 2 + 16 + 18 = 36
The size of TLB = 37  16 = 576 bits

5. Suppose that we have a system with the following characteristics:


• A memory and bus system supporting block access of 4 words.
• A 64-bit synchronous bus clocked at 200 MHz, with each 64-bit transfer
taking 1 clock cycle, and 1 clock cycle required to send an address to
memory.
• 2 clock cycles needed between each bus transaction. (Assume the bus is idle
before an access).
• A memory access time of 4 words is 300 ns.
Find the sustained bandwidth for a read of 256 words. Provide your answer in
MB/sec.

Answer:
1. 1 clock cycle that is required to send the address to memory
2. 300ns / (5ns/cycle) = 60 clock cycles to read memory
3. 2 clock cycles to send the data from the memory
4. 2 idle clock cycles between this transfer and the next
This is a total of 65 cycles, and 256/4 = 64 transactions are needed, so the entire
transfer takes 65  64 = 4160 clock cycles. Thus the latency is 4160 cycles  5
ns/cycle = 20,800 ns. The bus bandwidth is (256  4) bytes  (1sec / 20,800ns) =
49.23 MB/sec

79
95 年清大資工

1. Design an array multiplier that multiplies two 3 bit integer in two‘s complement
format and produces one 6-bit integer also in two‘s complement format.

Answer:
(b2) b1 b0 (b2) b1 b0
× (a2) a1 a0 × (a2) a1 a0
(a0b2) a0b1 a0b0  a0b1 a0b0
(a1b2) a1b1 a1b0 + a2b2 0 a1b1 a1b0
+ a2b2 (a2b1) (a2b0) − (a1b2) (a0b2)
c4 c3 c2 c1 c0 − (a2b1) (a2b0)
c4 c3 c2 c1 c0

b1 b0
a0

b2 a2 b1 b0
a1
0 0 0 0

FA FA FA FA 0
a1 a0
b2
0

FS FS FS 0
b1 b0
a2
0

FS FS FS 0

C4 C3 C2 C1 C0
FA: full adder
FS: full substract

80
註:如果題目改為兩無號數相乘,則答案更改如下
b2 b1 b0
× a2 a1 a0
b2 b1 b0
a0b2 a0b1 a0b0
a1b2 a1b1 a1b0 a0
+ a2b2 a2b1 a2b0
b2 b1 b0
c5 c4 c3 c2 c1 c0
a1
0

FA FA FA 0
b2 b1 b0
a2

FA FA FA 0

c5 c4 c3 c2 c1 c0

2. Design a synchronous sequential machine that has one input X(t) and one output
Y(t). Y(t) should be 1 if the machine has been more 1s than 0s in the input over
the past 3 time steps, and 0 otherwise. Below is a sample sequence:

t 0 1 2 3 4 5 6 7 8 9 10
X(t) 0 1 0 1 1 0 1 0 1 0 1
Y(t) - - - 0 1 1 1 1 0 1 0

Answer:
X(t-3) X(t-2) X(t-1) Y(t) X(t-1) X(t-2) X(t-3)
0 0 0 0
0 0 1 0 X(t) 3-bit shift register

0 1 0 0
0 1 1 1
1 0 0 0
Y(t)
1 0 1 1
1 1 0 1
1 1 1 1

Y(t) = X(t-3)X(t-2) + X(t-2)X(t-1) + X(t-3)X(t-1)

81
3. Design a vector-interrupt controller (VIC) that has four interrupt source A, B, C
and D with fixed priority A < B < C < D. In case of any interrupt occurred, the
VIC should output the ID of the interrupting source with the highest priority. For
example, if (A, B, C, D) = (0, 1, 0, 1), then the VIC should set ―Interrupt
Occurred‖ to 1 and ―Source ID‖ to 11 indicating that D is the interrupt source for
the host to serve. On the other hand, if (A, B, C, D) = (0, 0, 0, 0), then the VIC set
―Interrupt Occurred‖ to 0 indicating that no service is required.

A Interrupt Occurred
B Source ID
VIC
C
D

Answer:
Inputs Outputs
A B C D ID1 ID0 INT
0 0 0 0 X X 0
1 0 0 0 0 0 1
X 1 0 0 0 1 1
X X 1 0 1 0 1
X X X 1 1 1 1
CD CD
AB 00 01 11 10 AB 00 01 11 10

00  1 1 1 00  1 1

01 1 1 1 01 1 1 1

11 1 1 1 11 1 1 1

10 1 1 1 10 1 1

ID1 = C + D ID0 = D + BC’


INT = A + B + C + D
D
C ID
ID10

ID
ID10

INT
A

82
4. Consider a loop branch that branches nine times in a row, then is not taken once.
Assume that we are using a dynamic branch prediction scheme.
(a) What is prediction accuracy for this branch if a simple 1-bit prediction
scheme is used
(b) What is prediction accuracy for this branch if a 2-bit prediction scheme is
used?
(c) Please draw the finite state machine for a 2-bit prediction scheme.

Answer:
(a) 80%,因為第一次和最後一次會猜錯因此其正確率為(8/10)  100% = 80%
(b) 90%,最後一次會猜錯因此其正確率為(9/10)  100% = 90%
(c)

5. What feature of a write-through cache makes it more desirable than a write-back


cache in a multiprocessor system (with a shared memory)? On the other hand,
what feature of a write-back makes it more desirable than a write-through cache
in the same system?

Answer:
(a) Write-through improve the coherence between share memory and cache,
thereby reduces the complexity of the cache coherence protocol.
(b) Write-back reduces bus traffic and thereby allows more processors on a single
bus.

6. Consider a fully associative cache and a direct mapped cache with the same cache
size.
(a) Explain which one has a lower cache miss rate and why?
(b) The majority of processor caches today are direct-mapped, two-way set
associative, or four-way set associative, but not fully associative. Why?

Answer:

83
(a) Fully associative, because fully associative can eliminate the misses caused
by multiple memory location compete for the same cache location (conflict)
(b) Because the costs of extra comparators and delay imposed by having to do the
compare for fully associative are too high.

84
94 年清大資工

1. The terms big-endian and little-endian were originally found in Jonathan Swift's
book, Gulliver's Travels. Now all processors must be designated as either
big-endian or little-endian. For example, DEC Alpha RISC and Intel 80x86
processors are little-endian. Motorola 6800 microprocessors and Sun
SuperSPARC are big-endian.
(a) Briefly explain the differences between big-endian and little-endian.
(b) Please illustrate big-endian and little-endian by considering the number 4097
stored in a 4-byte_integer.

Address Big-Endian representation Little-Endian representation


00
01
02
03

Answer:
(a) In a big-endian system, the most significant value in the sequence is stored at
the lowest storage address (i.e., first). In a little-endian system, the least
significant value in the sequence is stored first.
(b) 4097 = 00001001hex
Address Big-Endian representation Little-Endian representation
00 00hex 01hex
01 00hex 10hex
02 10hex 00hex
03 01hex 00hex

2. A computer whose processes have 1024 pages in their address spaces keeps its
page tables in memory. The overhead required for reading a word from the page
table is 500 nsec. In order to reduce the overhead, the computer has a TLB
(Translation Look-aside Buffer), which holds 32 (virtual page, physical page
frame) pairs, and can do a look up in 100 nsec. What hit rate is needed to reduce
the mean overhead to 200 nsec?

Answer:
100 ns + (1 − H) × 500 ns = 200 ns;  H = 0.8

3. Assume the instruction “ADD R0, R1, R2, LSL#2‖ in one instruction set
architecture conducts R0 = R1 + R2 × 4 operation. Could you conduct R0 = 99 ×
R1 using two ADD instructions? Write down the code if your answer is YES;
otherwise, state the reasons.

85
Answer: YES
ADD R0, Rl, R1, LSL#5 // R0 = R1 + 32×R1 = 33×R1
ADD R0, R0, R0, LSL#1 // R0 = 33×R1 + 66×R1 = 99×R1
註:本題答案不唯一

4. Forwarding is a technique to eliminate the data hazards occurred among the


pipelining instructions. However, not all data hazards can be handled by the
forwarding. If an instruction following a LOAD instruction depends on the results
of the LOAD instruction, the data hazard is occurred, and the pipeline is stalled
one cycle.
(a) Assume the percentage of LOAD instruction is 20% is a program, and half the
time the instruction following a LOAD instruction needs the result of the
LOAD instruction. What is the performance degradation due to the data
hazard?
(b) Pipeline scheduling or instruction scheduling techniques could be used to
eliminate the data hazard mentioned in (a). What is the philosophy of pipeline
scheduling? Use an example to demonstrate that pipeline scheduling can
eliminate the data hazard.
(c) What is the possible overhead of pipeline scheduling?

Answer:
(a) 不考慮有任何hazard的pipeline machine其CPI為1。若發生load-use data
hazard則需stall 1 clock cycle. 20%的LOAD指令當中有一半有load-use的
data hazard則新的CPI = 1 + 0.2 × 0.5 × 1 = 1.1。因此pipeline的performance
降低了(1.1 – 1)/1 = 10%
(b) Rather then just allow the pipeline to stall, the compiler could try to schedule
the pipeline to avoid these stalls by rearranging the code sequence to
eliminate the hazards.
例如下列左邊程式兩個add指令都有data hazard的問題,若將第2個lw指令
和第1個add指令位置(如下列右邊程式)交換便可解決data hazard的問題
lw $t2, 4($t0) lw $t2, 4($t0)
add $t3, $t1, $t2 lw $t4, 8($t0)
sub $t6, $t6, $t7 sub $t6, $t6, $t7
lw $t4, 8($t0) add $t3, $t1, $t2
add $t5, $t1, $t4 add $t5, $t1, $t4
and $t8, $t8, $t9 and $t8, $t8, $t9
(c) Pipeline scheduling increases the number of registers used and compiler
overhead.

5. State two reasons that MIPS is not an accurate measure for comparing
performance among computers.

Answer:

86
1. MIPS specifies the instruction execution rate but does not take into account
the capabilities of the instructions
2. MIPS varies between programs on the same computer
3. MIPS can vary inversely with performance

6. (a) Consider the following sequence of address references given as word


addresses:
22, 10, 26, 30, 23, 18, 10, 14, 30, 11, 15, 19
For a 2-way set associative cache with a block size of 8 bytes, a word size of 4
bytes, a data capacity of 64 bytes and the LRU replacement, label each
reference in the sequence as a hit or a miss. Assume that the cache is initially
empty.
(b) Determine the number of bits required in each entry of a TLB that has the
following characteristics:
 The TLB is directed-mapped
 The TLB has 32 entries
 The page size is 1024 bytes
 Virtual byte addresses are 32 bits wide
 Physical byte addresses are 31 bits wide
Note that you only need to consider the following items for each entry:
 The valid bit
 The tag
 The physical page number
Answer: (a)
Referenced Referenced Contents
Address Address Tag Index Hit/Miss
(decimal) (Binary) set Block0 Block1
22 10110 10 11 Miss 3 22,23
10 01010 01 01 Miss 1 10,11
26 11010 11 01 Miss 1 10,11 26,27
30 11110 11 11 Miss 3 22,23 30,31
23 10111 10 11 Hit 3 22,23 30,31
18 10010 10 01 Miss 1 18,19 26,27
10 01010 01 01 Miss 1 18,19 10,11
14 01110 01 11 Miss 3 22,23 14,15
30 11110 11 11 Miss 3 30,31 14,15
11 01011 01 01 Hit 1 18,19 10,11
15 01111 01 11 Hit 3 30,31 14,15
19 10011 10 01 Hit 1 18,19 10,11
(b) The page size is 1024 bytes  page offset has 10 bits
Hence, the physical page number has 31 – 10 = 21 bits and
TLB has 32 entries  index has 5 bits, then tag size = 32 – 10 – 5 = 17
So, the number of bits in each entry = 1 + 17 + 21 = 39 bits

87
7. (1) The average memory access time (AMAT) is defined as
AMAT = time for a hit + miss rate × miss penalty
Consider the following two machines:
 Machine 1: 100 MHz, a hit time of 1 clock cycle, a miss rate of 5% and a
miss penalty of 20 clock cycles
 Machine 2: 100 MHz, a hit time of 1.2 clock cycles, a miss rate of 3% and
a miss penalty of 25 clock cycles
Determine which machine has smaller AMAT
(2) Assume that you are running a program which uses a lot of data and that 50%
of the data the program needs causes page faults and must be retrieved from a
disk array. If the program needs data at the rate of 600 Mbytes/second and each
disk in the disk array can supply data at the rate of 30 Mbytes/second, what is
the minimum number of disks required in the disk array? You do not need to
worry about disk errors.
(3) Assume that there are 10 pairs of processors and disk arrays placed all over a
network. Each processor needs data at the rate of 600 Mbytes/second from its
disk array across the network. If one-third of the total traffic crosses the
bisection of the network, what is the bisection bandwidth needed (in
Mbytes/second)?

Answer:
(1) AMAT for Machine 1 = 10 ns × (1 + 0.05 × 20) = 20 ns
AMAT for Machine 2 = 10 ns × (1.2 + 0.03 × 25) = 19.5 ns
Hence, Machine 2 has smaller AMAT
(2) (600 × 0.5) / 30 = 10 disks
(3) 10 × 600 × (1/3) = 2000 MB/second

88
93 年清大資工

1. Consider a 16-bit processor which includes a register file of 4 registers (R0-R3).


R3 is hardwired to act as the program counter. This processor has only one
instruction ―Rd = Rs + #immed‖, and is implemented using a 3-stage pipeline.
(1) Instruction Fetch (IF): Instruction Register (IR) = MEM[R3]; R3 += 2;
(2) Instruction Decoding (ID): Decode IR to determine Rd, Rs, and #immed.
(3) Execution (EX): Execute Rd = Rs + #immed.
The operation R3 += 2 of the IF-stage and the write-back of Rd in the EX-stage
occur at the very end of each clock cycle (CC). If there is a conflict (both EX: Rd
and IF: R3 are writing to the register R3), the write-back R3 overrides the
operation R3 += 2 in the IF-stage. Consider executing the following instruction
sequence and its timing chart of the pipeline operation
Clock Cycle (CC)
Instruction
Instruction 1 2 3 4 5 6 7
Address
0x0100 R0 = R3 + #0x001 IF ID EX
0x0102 R3 = R3 + #0x010 IF ID EX
0x0104 R2 = R3 + #0x100 IF ID EX
0x0106 R2 = R0 + #0x200 IF ID EX
0x???? R1 = R0 + #0x300 IF ID EX
(a) Right after CC = 1, what is the hexadecimal value stored in the register R3?
(b) Right after CC = 3, what is the hexadecimal value stored in the register R0?
(c) Right after CC = 4, what is the hexadecimal value stored in the register R3?
(d) Right after CC = 6, what is the hexadecimal value stored in the register R2?
(e) What is the hexadecimal address of last instruction in the table?

Answer:
(a) (b) (c) (d) (e)
0102 0103 0114 0303 0114
(注意: stage ID 除了做 instruction decode 外還負責 register fetch 因此此 stage
抓取的暫存器內容應是正在執行的這個 ID stage 前一個 clock 的暫存器內
容。例如第一個指令在第二個 clock 執行 ID,其抓取的 R3 應為 clock 1 的
R3 而非 clock 2 的 R3)

89
2. Suppose you are given an instruction set architecture (ISA) which includes only
two instruction formats
SUB Rd, Rs, #immed /* Rd = Rs - #immed */
ADD Rd, Rs, Rt /* Rd = Rs + Rt */
where Rd is the 8-bit destination register, Rs and Rt are the 8-bit source registers
and the immediate value can be an integer between -4 to 3. How would you
translate each of the following pseudo-instructions into one or multiple real ISA
instructions?
(1) MOV Rd, Rs /*Rd = Rs*/
(2) INC Rd /*Rd++ */
(3) MOV Rd, Rs, lsl 1 /* Rd = Rs << 1; i.e., left shift */
(4) CLEAR Rd /* Rd = 0 */

Answer:
(1) SUB Rd, Rs, #0
(2) SUB Rd, Rd, #-1
(3) ADD Rd, Rs, Rs
(4) ADD Rd, Rd, Rd
ADD Rd, Rd, Rd
ADD Rd, Rd, Rd
ADD Rd, Rd, Rd
ADD Rd, Rd, Rd
ADD Rd, Rd, Rd
ADD Rd, Rd, Rd
ADD Rd, Rd, Rd
(註: Rd shift left 8 次便可將其內容清為 0)

3. Suppose Booth‘s algorithm is used as our approach to multiplying two 8-bit


unsigned integer numbers. How many additions and subtractions are needed to
multiply 123 by 123?

Answer: Multiplier: 12310 = 0111 10112


– –

0 1 1 1 1 0 1 1 0
+ +

additions: 2 次, subtractions: 2 次

90
4. In a computer memory hierarchy, determine which of the following five
combinations of events (for locating a page of memory) in the cache, TLB, page
table and main memory are possible to occur. Answer Yes or No to each of (1), (2),
(3), (4) and (5).
Page Main
TLB Cache
Table Memory
(1) Hit Hit Hit Miss
(2) Hit Hit Miss Miss
(3) Miss Hit Hit Miss
(4) Miss Miss Hit Hit
(5) Hit Miss Hit Hit

Answer: (1) No (2) No (3) No (4) No (5) No

5. For a CPU to effectively handle service requests from peripheral devices,


vectored interrupt is a popular mechanism. To implement vectored interrupt
function a combinational circuit called priority encoder is commonly used.
(1) What is the operation principle of vectored interrupt?
(2) What is a priority encoder? How is it different from an ordinary encoder?

Answer:
(1) An interrupt vector is the memory address of an interrupt handler, or an index
into an array called an interrupt vector table. Interrupt vector tables contain
the memory addresses of interrupt handlers. When an interrupt is generated,
the processor saves its execution state, and begins execution of the interrupt
handler at the interrupt vector.

Processor

Interrupt Interrupt
Line Vector
Interrupt
Controller

Peripheral

(2) A priority encoder encodes only the highest-order active input, even if
multiple inputs are activated.
The ordinary encoder has the limitation that only one input can be active at
any given time. If two inputs are active simultaneously, the output produces
an undefined combination.

91
92 年清大資工

1. (1) Give the flow diagram of the procedures for multiplying two binary
floating-point numbers.
(2) Multiply the two decimal numbers 0.75ten and -0.375ten by using the steps from
your answer in (1). Show the step-by-step intermediate results in your answer.

Answer: (1) (2)

In binary, the task is


1.100two × 2-1 times −1.100two × 2-2

Step 1: adding the exponents


(−1 + 127) + (−2 + 127) −127 = 124

Step 2: Multiplying the significands:


1.100two × 1.100two = 10.010000two × 2-3,
but we need to keep it to 4 bits, so it is
10.01two×2-3

Step 3: Normalizing the product: 1.001 ×


2-2, since 127 ≥ −2 ≥ −126, so, no
overflow or underflow

Step 4: Rounding the product makes no


change: 1.001 two×2-2

Step 5: make the sign of the product


negative −1.001 two×2-2

Converting to decimal
−1.001two×2-2 = −0.01001two= −9/25ten =
−0.2812ten

92
2. (a) What are the five steps required for the normal MIPS instructions? Briefly
describe each step in one sentence.
(b) Consider the following two contignous MIPS instructions.
add $s0, $t0, $tl
sub $t2, $s0, $t3
What solution can be used to resolve the data hazard problem in the two
instructions? Give a graphical instruction-pipeline representation of your
solution.

Answer:
(a) 1. Instruction Fetch
2. Instruction Decode and Register Fetch
3. Execution, Memory Address Computation, or Branch Completion
4. Memory Access or R-type instruction completion
5. Memory Read Completion
(b) Forwarding

3. Suppose we have a processor with a base CPI of 1.0, assuming all reference hit in
the primary cache, and a clock rate of 800 MHz. Assume a main memory access
time of 125 ns, including all the miss handling. Suppose the miss rate per
instruction at the primary cache is 4%. What is the total CPI for this machine with
one level of caching? Now we add a secondary cache that has a 20 ns access time
for either a hit or a miss and the secondary cache is large enough to reduce the
miss rate to main memory to 2%? What is the total CPI for this machine with a
two-level cache?

Answer:
(1) CPU clock cycle time = 1 / 800 MHz = 1.25 ns
Miss penalty for main memory = 125 / 1.25 = 100 clock cycles
CPI for machine with one-level cache = 1 + 100 × 0.04 = 5
(2) Miss penalty for second level cache = 20 / 1.25 = 16 clock cycles
CPI for machine with two-level cache = 1 + 0.04 × 16 + 0.02 × 100 = 3.64

93
4. (1) What is the ideal performance improvement for an n-stage pipeline machine?
(2) Write two reasons that the pieline machine cannot achieve the ideal
pformance except data hazard
(3) What are the methods to remove the data hazard?
(4) Describe what is a carry save adder tree.

Answer:
(1) n times faster than the machine without pipelining
(2) (a) The stages may be imperfectly balanced
(b) The delay due to pipeline registers
(c) Control hazard.
(d) Time to ―fill‖ and ―drain‖ the pipeline
(3) (a) Insert nop instruction by compiler.
(b) Reorder code sequence by compiler.
(c) Forwarding by hardware.
(4) It is a tree of carry-save adders arranged to add the arguments in parallel.
Each carry-save adder in each level adds three operands and produces two
results. Carry save adder tree usually used for the addition of partial products
of multiplication.

5. RAID (redundant arrays of inexpensive disks) have been widely used to speed up
the disk access time. Several levels of RAID are supported. Please make the right
binding between the following RAID levels and explanations.
(1) RAID-0 (A) block-interleaved parity
(2) RAID-1 (B) non-redundant striping
(3) RAID-4 (C) mirrored disks
(4) RAID-5 (D) block-interleaved distributed parity

Answer:
RAID-0 RAID-1 RAID-4 RAID-5
(B) (C) (A) (D)
non-redundant mirrored disks block-interleaved block-interleaved
striping parity distributed parity

6. DSP processors are increasingly employed in embedded systems for supporting


audio and video applications. Explain the key features of DSP processors
(different) from conventional general-purpose processors.

Answer:
The essential difference between a DSP and a microprocessor is that a DSP
processor has features designed to support high-performance, repetitive,
numerically intensive tasks. In contrast, general-purpose processors are not

94
specialized for a specific kind of applications.
Features that accelerate performance in DSP applications include:
1. Single-cycle multiply-accumulate capability. High-performance DSPs often
have two multipliers that enable two multiply-accumulate operations per
instruction cycle.
2. Specialized addressing modes. DSPs generally feature multiple-access
memory architectures that enable DSPs to complete several accesses to
memory in a single instruction cycle.
3. Specialized execution control. Usually, DSP processors provide a loop
instruction that allows tight loops to be repeated without spending any
instruction cycles for updating and testing the loop counter or for jumping
back to the top of the loop.
4. DSP processors are known for their irregular instruction sets, which generally
allow several operations to be encoded in a single instruction.

95
96 年交大資聯

1. (Choice)
(1) Which is (are) correct?
a. Suppose there was a 16-bit IEEE 754-like floating-point format with 5
exponent bits (±1.0000 0000 00  2-15 to ±1.1111 1111 11  214, ±0, ±∞, NaN)
is the likely range of numbers it could represent.
b. For 32-bit IEEE 754 floating-point standard, the smallest positive normalized
number is: 1.0000 0000 0000 0000 0000 000  2-125.
c. For 32-bit IEEE 754 floating-point standard, the smallest denormalized
number is: 0.0000 0000 0000 0000 0000 001  2-126.

Answer: c
註a:±1.0000 0000 00  2-14 to ±1.1111 1111 11  215, ±0, ±∞, NaN
註b:1.0000 0000 0000 0000 0000 000  2-126

(2) Some programming languages allow two's complement integer arithmetic on


variables declared byte and half word, i.e., 16 bits. What MIPS instructions would
be used?
a. Load with lbu, lhu; arithmetic with add, sub, mult, div; then storing using sb,
sh.
b. Load with lb, lh, arithmetic with add, sub, mult, div; then storing using sb, sh.
c. Loads with lb, lh; arithmetic with add, sub, mult, div; using and to mask result
to 8 or 16 bits after operation; then store using sb, sh.

Answer: b

(3) Carry look-ahead adder can diminish the carry delay which dominates the delay
of ripple carry adder. Generate (gi) and propagate (pi) functions are two main
operations of carry look-ahead adder. Assume a and b are two operands and ci+1
is the carry out of level i and carry in of level i + 1, which is (are) correct?
a. gi = ai  bi
b. pi = (ai + bi)  ci
c. If gi equals to 1, we can say the carry out of level i is 1.
d. Carry look-ahead adder can be extended to multi-level style. The first group
generate of a 3-bit group can then be defined as Go = g2 + (p2  g1) + (p2  p1 
g0)

Answer: a, c, d
註b:ai + bi

96
Structure a Structure b
(4) The above figure shows two multiplication structures. Which is correct?
a. The shift operation in the multiplicand in structure a is shift-right.
b. The shift operation in the multiplier in the structure a is shift-right.
c. The multiplier is stored in the right part of the product register in structure b.
d. In structure b, one control signal for shifting multiplicand register is missed.

Answer: a
註a:shift-left
註c:the multiplier is stored in the right part of the product register initially but is
faded away after multiplication is completion

(5) About the 32-bit MIPS instructions, which description is correct?


a. MIPS has 32 registers inside CPU because it is a 32-bit CPU.
b. add instruction can not directly store the addition result to memory.
c. Since memory structure is byte-addressing, the address offset in beq
instruction is referred to as byte.
d. In MIPS, "branch-if-less-than" is realized using slt and beq/bne, since its
design principle is two faster instructions are more useful than one slow and
complicated instruction.

Answer: b
註a:32-bit CPU means the length of register is 32
註c:address offset in beq instruction is referred to as word
註d:design principle- smaller is faster (keeping instruction set small)

2. (a) What is procedure frame? Also, stack pointer and frame pointer are used to
maintain procedure frame. Why does procedure frame require two pointers?
(b) Procedure has to spill registers to memory (save and then restore). Caller must
take care of $ax series and $tx series and callee must take care of $ra and $sx
series. Following codes require correction for spilling registers. Correct the
errors and state your reasons.

97
fact: L1:
addi $sp, $sp, -4 addi $a0, $a0, -1
sw $ra, 0($sp) jal fact
slti $t0, $a0, 1 lw $ra, 0($sp)
beq $t0, $zero, L1 addi $sp, $sp, 4
addi $v0, $zero, 1 mul $v0, $a0, $v0
addi $sp, $sp, 4 jr $ra
jr $ra

Answer:
(a) A procedure frame is the segment of stack containing a procedure‘s saved
registers and local variables.
A frame pointer is used to point to the location of the saved registers and local
variables for a given procedure. A stack pointer might change during the
procedure, and so references to local variable in memory might have different
offsets depending on where they are in the procedure, making the procedure
harder to understand. Alternatively, a frame pointer offers a stable base register
within a procedure for local memory references.
(b)
fact: L1:
addi $sp, $sp, -8 addi $a0, $a0, -1
sw $ra, 4($sp) jal fact
sw $a0, 0($sp) lw $a0, 0($sp)
slti $t0, $a0, 1 lw $ra, 4($sp)
beq $t0, $zero, L1 addi $sp, $sp, 8
addi $v0, $zero, 1 mul $v0, $a0, $v0
addi $sp, $sp, 8 jr $ra
jr $ra

Since procedure fact will recursively call itself and the argument in register $a0
will still be used after the call has returned, the content of register $a0 should
be saved before the call has made。

3. Please explain the concept of non-restoring division algorithm.

Answer:
Restoring的除法演算法中,我們以減法來判斷被除數是否大於除數,若減完的
結果為負,則我們必頇將除數加回以恢復被除數的值(r + d),接著將被除數向
左移一位元((r + d) × 2),並於下一回合減掉除數以判斷被除數是否大於除數((r
+ d) × 2 - d)。Nonrestoring的除法演算法中,當被除數減完除數的結果若為負
時,我們並不馬上將除數加回被除數,而是先將被除數向左移一位元(r × 2),

98
並於下一回合加上除數(r × 2 + d),這是因為(r + d) × 2 – d = r × 2 + d。假設加
法和減法所花的時間是一樣的,則使用Nonrestoring的方式可以在一個回合的
計算中少做一次減法,因而其performance較佳。
4. We wish to compare the performance of two different computers: M1 and M2.
Following measurements have been made on these computers: Program 1
executes for 2.0 seconds on M1, and 1.5 seconds on M2, whereas Program 2
executes for 5.0 seconds on M1, and 10.0 seconds on M2.
(a) Which computer is faster for each program, and how many times as fast is it?
The following additional measurements were then made: Program 1 executes 5 
109 instructions on M1, and 6  109 instructions on M2.
(b) Find instruction execution rate (instructions/second) for each computer when
running Program 1.
Suppose M1 costs $500 and M2 costs $800. A user requires that Program 1 must
be executed 1600 times each hour. Any remaining time is used to run Program 2.
If the computer has enough performance to execute Program 1 the required
number of times per hour, then performance is measured by the throughput for
Program 2.
(c) Which computer is faster for this workload? Why?
(d) Which computer is more cost-effective? Show your calculations.

Answer:
(a) For program 1, M2 is 2/1.5 = 1.33 times faster than M1
For program 1, M1 is 10/5 = 2 times faster than M2
(b) The instruction execution rate for M1 = (5  109)/2 = 2.5  109 instr./sec.
The instruction execution rate for M2 = (6  109)/1.5 = 4  109 instr./sec.
(c) Executing program 1 1600 times on M1 takes 2(1600) = 3200 sec which
leaves 400 sec for program 2. Hence it will execute 400/5 = 80 times.
Executing program 1 1600 times on M2 takes 1.5(1600) = 2400 sec which
leaves 1200 sec for program 2. This program takes 10 sec on M2 so it will
execute 1200/10 = 120 times during the hour.
Therefore M2 is faster for this workload
(d) So far as cost effectiveness we can compare them by $500/80 = $6.25 per
iteration/hr for M1, while for M2 we have $800/120 = $6.67 per iteration/hr,
so M1 is more cost effective.

99
5. Given the code sequence:
lw $t1, 8($t7) ; assume mem($t7+8) contains (+72)10
addi $t2, $zero, #10
nor $t3, $t1, $t2
beq $t1, $t2, Label
add $t4, $t2, $t3
sw $t4, 108($t7)
Label: ...
According to the multi-cycle implementation scheme in the textbook (see figure
below),
(a) How many cycles will it take to execute this code?
(b) What is going on during the 19th cycle of execution?
(c) In which cycle does the actual addition of 108 and $t7 take place?
Action for R-type Action for Memory- Action for Action for
Step Name
Instructions Reference Instructions branches jumps
IR  Memory[PC]
Instruction fetch
PC  PC+4
A  Reg[IR[25-21]]
Instruction decode/register fetch B  Reg[IR[20-16]]
ALUOut  PC + sign-extend(IR[15-0]) << 2
Execution, address computation, ALUOut  A + sign-extend If (A==B) then PC  {PC[31-28],
ALUOut  A op B
branch/jump completion (IR[15-0]) PC  ALUOut; (IR[25-0]<<2)}
Load: MDR  Memory[ALUOut]
Memory Access or R-type Reg[IR[15-11]] 
or
completion ALUOut
Store: Memory[ALUOut]  B
Memory read completion Load: Reg[IR[20-16]]  MDR

Answer:
(a) 5 + 4 + 4 + 3 + 4 + 4 = 24
(b) The contents of registers $t2 and $t3 are added by ALU
(c) the 23th cycle

6. Instruction count, CPI, and clock rate are three key factors to measure
performance. The performance of a program depends on the algorithm, the
programming language, the compiler, the instruction set architecture, and the
actual hardware used.
(a) What performance factor(s) above may be affected by using different
Instruction Set Architectures? Why?
(b) MIPS (Million Instructions per Second) of running a benchmark program on
machine A is higher than that of running the same benchmark on machine B.
Which machine is faster? Why?

Answer:
(a) Instruction count, CPI, and clock rate
(b) We can not differentiate which machine is faster from the measure of MIPS
before the capabilities of the ISA of these two machines are given.

100
7. To implement these five MIPS instructions: [lw, sb, addi, xor, beq],
(a) If simple single-cycle design is used, at least how many adders must be used?
What each of these adders is used for?
(b) Similarly, at least how many memories are there? What each of them is used
for?
(c) Repeat (a) for multi-cycle design.
(d) Repeat (b) also for multi-cycle design.

Answer:
(a) two adders are needed for single cycle design, one for PC + 4 calculation and
the other for branch target address calculation
(b) two memories are needed for single cycle design,, one for instruction fetch
and the other for data access
(c) no adders is needed for multi-cycle design
(d) one memory is needed for both instruction and data access
8. Assume the three caches below, each consisting of 16 words. Given the series of
address references as word addresses: 2, 3, 4, 16, 18, 16, 4, 2. Please label each
reference as a hit or a miss for the three caches (a), (b), and (c) below. Assuming
that LRU is used for cache replacement algorithm and all the caches are initially
empty.
(a) a direct-mapped cache with 16 one-word blocks;
(b) a direct-mapped cache with 4 four-word blocks;
(c) a four-way set associative cache with block size of one-word.

Answer:
(a)
Word address
Tag Index Hit/Miss 3C
Decimal Binary
2 00010 0 0010 Miss compulsory
3 00011 0 0011 Miss compulsory
4 00100 0 0100 Miss compulsory
16 10000 1 0000 Miss compulsory
18 10010 1 0010 Miss compulsory
16 10000 1 0000 Hit
4 00100 0 0100 Hit
2 00010 0 0010 Miss conflict
(b)
Word address
Tag Index Hit/Miss 3C
Decimal Binary
2 00010 0 00 Miss compulsory
3 00011 0 00 Hit
4 00100 0 01 Miss compulsory

101
16 10000 1 00 Miss compulsory
18 10010 1 00 Hit
16 10000 1 00 Hit
4 00100 0 01 Hit
2 00010 0 00 Miss conflict

(c)
Word address
Tag Index Hit/Miss 3C
Decimal Binary
2 00010 000 10 Miss compulsory
3 00011 000 11 Miss compulsory
4 00100 001 00 Miss compulsory
16 10000 100 00 Miss compulsory
18 10010 100 10 Miss compulsory
16 10000 100 00 Hit
4 00100 001 00 Hit
2 00010 000 10 Hit

9. Continued from above question 8:


(a) For each of above (a), (b), and (c) caches, how many misses are compulsory
misses?
(b) For each of above (a), (b), and (c) caches, how many misses are conflict
misses?
(c) What type of cache misses (compulsory, conflict and capacity) can be reduced
by increasing the cache block size?
(d) What type of cache misses can be reduced by increasing set associativity?

Answer:
Cache configuration (a) (b)
direct-mapped cache with 16 one-word blocks 5 1
direct-mapped cache with 4 four-word blocks 3 1
four-way set associative cache with block size of one-word 5 0

(c) compulsory
(d) conflict

102
10. What is the average CPI for each of the following 4 schemes taking to execute the
code sequence below? (Note: For the pipeline scheme, there are five stages: IF,
ID, EX, MEM, and WB. We assume the reads and writes of register file can
occur in the same clock cycle, and the stall circuits are available.)
add $t3, $s1, $s2
sub $t1, $s1, $s2
lw $t2, 100($t3)
sub $s1, $t1, $t2
(a) single cycle scheme;
(b) multi-cycle scheme without pipelining;
(c) pipelined scheme without data forwarding hardware;
(d) pipelined scheme with data forwarding hardware (one from EX/MEM to ALU
input, and the other from MEM/WB to ALU input) available.

Answer:
(a) CPI = 1
(b) CPI = (4 + 4 + 5 + 4)/4 = 4.25
(c) The clocks for executing this code = (5 – 1) + 4 + 3 = 11, CPI = 11/4 = 2.75
(d) The clocks for executing this code = (5 – 1) + 4 + 1 = 9, CPI = 9/4 = 2.25

103
95 年交大資工

1. Booth‘s algorithm is an elegant approach to multiply signed numbers. It starts


with the observation that with the ability to both add and subtract there are
multiple ways to compute a product. The key to Booth‘s insight is in his
classifying groups of bits into the beginning, the middle, or the end of a run of 1s.
Is Booth‘s algorithm always better? Why does the Booth‘s algorithm work for
multiplication of two‘s complement signed integers?

Answer:
(1) 當乘數為 0、1 交錯時(例如:01010101) Booth‘s algorithm 的表現並不會
優於傳統乘法演算法。
(2) 假設乘數 a = a31a30…a0,被乘數 b = b31b30…b0 皆為有號數,則乘數 a 的十
進位數值為
(a31  231 )  (a30  230 )  (a29  2 29 )  .....  (a1  21 )  (a0  2 0 ) 。
依 Booth‘s algorithm,其進行之運算如下所示:
(a 1  a 0 )  b  2 0
ai ai-1 Operation ai-1 - ai Operation
 (a 0  a1 )  b  21
0 0 Do nothing 0 Do nothing
 (a1  a 2 )  b  2 2
0 1 Add b +1 Add b
... . ....
1 0 Subtract b -1 Subtract b
 (a 29  a30 )  b  2 30
1 1 Do nothing 0 Do nothing
 (a30  a31 )  b  2 31

      
b  a31  2 31  a30  2 30  a 29  2 29  .....  a1  21  a0  2 0   
 ba
由此證明 Booth‘s algorithm 執行的是有號述的乘法。

2. The general division algorithm is called restoring division, since each time the
result of subtracting the divisor from the dividend is negative you must add the
divisor back into the dividend to restore the original value. An even faster
algorithm doses not immediately add the divisor back if the remainder is negative.
This nonrestoring division algorithm takes 1 clock per step. Using the expression
(r + d) × 2 – d = r × 2 + d to explain the non-restoring algorithm.

Answer:
Restoring的除法演算法中,我們以減法來判斷被除數是否大於除數,若減完
的結果為負,則我們必頇將除數加回以恢復被除數的值(r + d),接著將被除
數向左移一位元((r + d) × 2),並於下一回合減掉除數以判斷被除數是否大於

104
除數((r + d) × 2 − d)。Nonrestoring的除法演算法中,當被除數減完除數的結
果若為負時,我們並不馬上將除數加回被除數,而是先將被除數向左移一位
元(r × 2),並於下一回合加上除數(r × 2 + d),這是因為(r + d) × 2 – d = r × 2 +
d。假設加法和減法所花的時間是一樣的,則使用Nonrestoring的方式可以在
一個回合的計算中少做一次減法,因而其performance較佳。

3. Multiple forms of addressing are generally called addressing modes. The MIPS
addressing modes are the following:
(1) Register addressing, where the operand is a register.
(2) Base or displacement addressing, where the operand is at the memory
location whose address is the sum of a register and a constant in the
instruction.
(3) Immediate addressing, where the operand is a constant within the instruction
itself.
(4) PC-relative addressing, where the address is the sum of the PC and a constant
in the instruction.
(5) Pseudodirect addressing, where the jump address is the 26 bits of the
instruction concatenated with the upper bits of the PC.
The following binary codes are corresponding to their MIPS instructions,
respectively. Indicate these two instructions belonging to which of the above
addressing modes and according to them, find binary codes of [add $s4, $t3, $t2]
and [lw $s0, 48($t1)].
add $t0, $s1, $s2 00000010 00110010 01000000 00100000
lw $t0, 32($s2) 10001110 01001000 00000000 00100000

Answer:
1. add $s4, $t3, $t2 belongs to Register addressing and its binary code is
00000001 01101010 10100000 00100000
2. lw $s0, 48($t1) belongs to Base or displacement addressing and its binary
code is 10001101 00110000 00000000 00110000

4. A compiler designer is trying to decide between two code sequences for a


particular machine. The hardware designers have supplied the following facts: the
CPI (clocks per instruction) of instruction class A is 1, the CPI of instruction class
B is 2 and the CPI of instruction class C is 3. For a particular high-level-language
statement, the compiler writer is considering two code sequences that require the
following instruction counts: Sequence-1 executes 2 As, 1 B, and 2 Cs, and
Sequence-2 executes 4 As, 1 B, and 1 C.
Sequence-1 executes 2 + 1 + 2 = 5 instructions. Sequence-2 executes 4 + 1 + 1 =
6 instructions. So Sequence-1 executes fewer instructions. We also know that
CPU clock cycles1 = (2  1) + (1  2) + (2  3) = 10 cycles and
CPU clock cycles2 = (4  1) + (1  2) + (1  3) = 9 cycles.

105
So Sequence-2 is faster, even though it actually executes one extra instruction.
Since Sequence-2 takes fewer overall clock cycles but has more instructions, it
must have a lower CPI.
CPI1 = CPU clock cycles1 / instruction count1 = 10/5 = 2.
CPI2 = CPU clock cycles2 / instruction count2 = 9/6 = 1.5.
The above shows the danger of using only one factor (instruction count) to assess
performance. When comparing two machines, you must look at all three
components, which combine to form execution time. If some of the factors are
identical, like the clock rate in the above example, performance can be
determined by comparing all the non-identical factors.
The question is that, based on CPI1/CPI2 = 2/1.5, can we provide two versions of
the particular machine, in which (clock rate of Msequence-1) / (clock rate of
Msequence-2) = 4/3, to let the two code sequences have the same execution time?

Answer: No.
Suppose that clock rate of Msequence-1 is 4 GHz and of Msequence-2 is 3 GHz.
The execute time for Sequence-1 = (5  2) / 4  109 = 2.5 ns
The execute time for Sequence-2 = (6  1.5) / 3  109 = 3 ns
Even though the clock ration of these two machines is 4/3, the execution time of
these two machines is different.

5. Following the above question assume (clock rate of Msequence-1) / (clock rate of
Msequence-2) = 6/5, you are asked to adjust the CPI of instruction class C to let the
two code sequences have the same execution time?

Answer:
Suppose that the adjusted CPI of class C instruction is x.
Then, ((2  1) + (1  2) + (2  x))/ 6 = ((4  1) + (1  2) + (1  x)) / 5
x=4
The CPI of instruction class C should be adjusted to 4

6. Suppose a program runs in 100 seconds on a machine, with multiply operations


responsible for 80 seconds of this time. The execution time of the program after
making the improvement is given by the following equation:
execution time after improvement = (exec time affected by improvement / amount
of improvement) + (exec time unaffected)
If one wants the program to run two times faster, that is 50 seconds, then
execution time after improvement = (80 seconds / n) + (100 − 80 seconds)
So, 50 seconds = (80 seconds In) + 20 seconds. Thus n, the amount of
improvement, is 8/3.
The performance enhancement possible with a given improvement is limited by
the amount that the improved feature is used. This concept is referred to as
Amdahl's law in computing.

106
Let's consider a general model that the subsystem-A operations responsible for a
seconds and the subsystem-B operations responsible for b seconds of the total
execution time t seconds. We also recognize that the improvement needs costs.
Assume that the subsystem-A needs cost of CA to get 10/9 improvement and it
continues needing cost of CA to get 10/9 improvement of the improved
subsystem-A, i.e., 100/81 improvement of the original-subsystem-A. Assume the
improvement is restricted as the above discrete function. Subsystem-B follows
the same rule with the discrete 10/9 improvement and discrete cost of CB.
Suppose the subsystem-A has nA times improvements and subsystem-B has nB
times improvements.
The question is that under the total cost limitation CL, you are asked to discuss
how to formulate the problem to get the maximum improvement by improving
both subsystem-A and subsystem-B.
(a) Calculate both the costs to improve subsystem-A and subsystem-B.
(b) Calculate both the responsible time of improved subsystem-A and
subsystem-B.
(c) Formulate the problem you are going to solve.

Answer:
(a) The cost for subsystem-A = CA  nA
The cost for subsystem-B = CB  nB
a
(b) The responsible time of improved subsystem-A = nA (seconds)
 10 
 
9
b
The responsible time of improved subsystem-B = nB (seconds)
 10 
 
9
 
 
 a b 
(c) Minimize    (t  a  b )  , where CA  nA + CB  nB  CL
n n
  10  A  10  B 
 9    
  9 

7. A simple, single-cycle implementation of MIPS processor capable of executing


{lw, sw, add, sub, and, or, slt, beq, j} instructions is given in the Patterson and
Hennessy book.
(a) How many adders does this implementation need? And what does each of the
adders do?
(b) If we change the implementation to a multi-cycle style, then at least how
many adders do we still need? And what does each of these adders do?

Answer:

107
(a) Two adders are required for single-cycle implementation. One is for
computing the branch target, and the other is for computing the next
instruction address (PC + 4).
(b) None of adders is needed, because ALU could compute the next instruction
address in the first cycle and the branch target in the second cycle.

8. Comparing the following two implementations for MIPS processor: Multi-cycle


approach (all cycles dedicated to executing a single instruction), and pipelined
approach:
(a) There are two major advantages of the multi-cycle approach. What are they?
(b) For the pipelined approach, what extra hardware costs will it require? (State
only the principle, and you do not need to list the exact hardware items.)
(c) What is the most noticeable advantage of the pipelined approach?

Answer:
(a) The ability to allow instructions to take different numbers of clock cycles and
the ability to share functional units within the execution of a single
instruction.
(b) Pipeline registers, memory and adders.
(c) Pipelined approach gains efficiency by overlapping the execution of multiple
instructions, increasing hardware utilization and improving performance.

9. Given a microprogram controlled MIPS processor, its control store contents, and
microprogram sequencer together with two dispatch tables (in which the ―value‖
field indicates the microinstruction address in the control store) below:

-
- -
-
-
-
-

108
ALU Register PCWrite
Label SRC1 SRC2 Memory Sequencing
control control control
Fetch Add PC 4 Read PC ALU Seq
Add PC Extshft Read Dispatch 1
Mem1 Add A Extend Dispatch 2
LW2 Read ALU Seq
Write MDR Fetch
SW2 Write ALU Fetch
Rformat1 Func code A B Seq
Write ALU Fetch
BEQ1 Subt A B ALUOut-cond Fetch
JUMP1 Jump address Fetch

Dispatch ROM 2 Dispatch ROM 1


Op Opcode name Value Op Opcode name Value
100011 1w 0011 000000 R-format 0110
101011 sw 0101 000010 jmp 1001
000100 beq 1000
100011 1w 0010
101011 sw 0010

(Note that SRC1 and SRC2 in control store are represented as ALUScrA and
ALUScrB in the processor diagram.)
(d) How many cycles will it need to execute the following code sequence:
lw $t1, 0($t3)
adi $t3, $t3, #2
sub $t1, $t1, $t2
sw $t1, 0($t4)
adi $t4, $t4, #4
beq $t3, $t5, Label

109
(e) What operations are undertaken in the third cycle of an R-format instruction?
Include involved latches, multiplexers and other function units in your
answer.
(f) Repeat (b) for the second cycle of a memory reference instruction.

Answer:
(a) 5 + 4 + 4 + 4 + 4 + 3 = 24 cycles
(b) ALU depend on function code to execute the specify function and the two
operands are selected by setting ALUScrA = 1 to choose register A and by
setting ALUScrB = 00 to choose register B.
(c) Read register file and use ALU to compute PC + Extshft by setting ALUScrA
= 0 to choose register PC and by setting ALUScrB = 11 to choose
SignExt[IR[imm16]]<<2.

10. In the pipelined datapath design below, there is an obvious problem:

(a) Point out the problem, and explain it.


(b) Then, indicate how the problem can be corrected.

Answer:
(a) Consider a load instruction in the MEM/WB pipeline register. The instruction
in the IF/ID pipeline register supplies the write register number, yet this
instruction occurs considerable after the load instruction.
(b) We need to preserve the destination register number in the load instruction.
Load must pass the register number from the ID/EX through EX/MEM to the
MEM/WB pipeline register for use in the WB stage.

110
11. Given a 2S-byte, 2L-byte-per-line cache in an M-bit, byte-addressable memory
system,
(a) What is the range of index field size, in no. of bits, in a memory address while
accessing the cache?
(b) Repeat (a) for the tag field size.

Answer:
The number of blocks in the cache = 2S / 2L = 2S-L
(a) (b)
Tag Index Offset
Direct-mapped M–S S–L L
Fully associative M–L 0 L

12. In the RAID design, seven levels of the RAIDs are introduced in a commonly
used textbook.
(a) Which level of RAID uses the least storage redundancy? How much is the
redundancy?
(b) Which level used the most redundancy, and how much is it?
(c) What is the most noticeable drawback of RAID 4 (block-interleaved parity)?
And how does RAID 5 correct this drawback?

Answer:
(a) (1) RAID 0; (2) 0 redundancy
(b) (1) RAID 1; (2) the number of data disks
(c) (1) Parity disk is the bottleneck;
(2) Spread the parity information throughout all the disks to avoid single
parity disk bottlenecks.

111
13. In parallel processors sharing data, answer the following:
(a) In uniform memory access (UMA) designs, do all processors use the same
address space?
(b) Do all UMA processors access memory at the same speed?
(c) Draw a system diagram showing how processors and memory (modules) and
connected.

Answer:
(a) Yes
(b) Yes
(c)
Processor Processor Processor

Cache Cache Cache

Single bus

Memory I/O

112
94 年交大資工

1. (a) What is response time? Who will care about the response time?
(b) What is throughput? Who will care about the throughput?
(c) Think of an example in which if we improve the response time of a computer
system, its throughput will be worsened.
(d) Now think of an example in which if we improve the throughput of a computer
system, its response time will be worsened

Answer:
(a) (1) The time between the start and complement of a task. (2) Individual
computer users.
(b) (1) The total amount of work done in a given time. (2) Data center managers.
(c) 假設一系統中有非常多的processes在執行,如果我們使用round-robin CPU
scheduling且time slice設的很短。在這種情況下response time會改善但
throughput會變差。
(d) 假設一系統中有非常多的processes在執行,如果我們使用shortest job first
CPU scheduling,在這種情況下throughput會改善但對於工作時間長的
processes其response time會變長。

2. Given three classes of instructions: class A, B, and C, having CPIA = a, CPIB = b,


and CPIC = c, where CPI stands for cycles per instruction.
(a) If we can tune the clock rate to 120% without affecting any CPIA,B,C, what is
the performance gain G(a) = [performancenew/performanceoriginal -1] = ?
(b) Increasing clock rate to 150% and then CPIA‘ = 1.5CPIA, while CPIB‘ and CPIC‘
remain unchanged. If class A instructions account for 40% of all dynamic
instructions, what is the performance gain G(b) = ?
(c) Now let the compiler come into play. Given original clock rate, if for every
class A instruction to be eliminated, there must be x class B instructions and y
class C instructions added into the execution stream. Under what condition
would you want to eliminate class A instructions?

Answer:
(a) G(a) = (1.2 – 1) = 0.2
1
(b) G(b) =  1  1.25  1  0.25
1
(0.4  1.5  0.6) 
1.5
(c) x  b  y  c  a

113
3. MIPS has only a few addressing modes.
(a) What are these addressing modes?
(b) What makes MIPS use these addressing modes, but not others, and not more
modes? To answer this question properly, you should formulate the principles
behind the selection of these modes.

Answer:
(a) (b)
1. Register addressing Simplicity favors regularity
2. Base or displacement addressing
3. Immediate addressing Make the common case fast
4. PC-relative addressing Good design demands good compromises
5. Pseudodirect addressing

4. Given A = an-1an-2…a2a1a0, B = bn-1bn-2…b2b1b0, and carry-in c0:


(a) What is the equation for gi, which indicates that a carry ci+1 must be generated
regardless of ci?
(b) What is the equation for pi, which indicates that ci+i = ci?
(c) cn = f (gx, px, c0)|x = (n - l) ~ 0 = ?

Answer:
(a) gi = aibi
(b) pi = ai + bi
(c) cn = gn-1 + pn-1gn-2 + pn-1pn-2gn-3 + .... + pn-1pn-2...p1p0c0

114
5. Given the hardware and the flow chart of a multiplication algorithm,
Start
Multiplicand
32 bits
Product0 = 1 Test Product0 = 0
32-bit ALU Product0

1a. Add multiplicand to the


left half of product & place
Product Shift right the result in the left half of
Control
Write Product register
64 bits
product0
2. Shift the Product
register right 1 bit

No: < 32 repetitions


32nd
repetition?

Yes: 32 repetitions
Done

(a) Modify the hardware as less as possible for Booth's multiplication algorithm.
Draw the modified hardware.
(b) Based on the modified hardware designed in (1), redraw the
flow chart for Booth‘s multiplication algorithm.
(c) Describe the characteristics and advantages of Booth‘s algorithm.

Answer: (a) (b)


Start

01 10
Test
Product0/-1

Multiplicand Add multiplicand to the left subtract multiplicand from


half of product & place the 0 0/11 the left half of product &
32 bits result in the left half of place the result in the left
Product register half of Product register
32-bit ALU

Mythical bit
2. Shift the Product
Product Shift right register right 1 bit
Control
Write
64 bits
No: < 32 repetitions
Product0/Product-1 32nd
repetition?

Yes: 32 repetitions
Done

(c) The major features of Booth's algorithm are that it handles signed number
multiplication and if shifting is faster than addition, it would be faster than
traditional sequential multiplier in average case.

115
6. Given the simplified datapath of a pipelined computer and a sequence of code:

IM Reg DM R eg

add $3, $1, $2 (Instruction 1)


lw $4, 100($3) (Instruction 2)
sub $2, $4, $3 (Instruction 3)
sw $4, 100($1) (Instruction 4)
and $6, $3, $2 (Instruction 5)
(a) Identify all of the data dependencies in the code by the following
representation:
Ri: Ij  Ik
It means that Instruction k (Ik): depends on Instruction j (Ij) for Register i
(Ri).
(b) Which dependencies are data hazards that may be data forwarded?
(c) Which dependencies are data hazards that must be resolved via stalling?

Answer:
(a) R3: I1  I2 (b) R3: I1  I2 (c) R4: I2  I3
R3: I1  I3 R3: I1  I3
R3: I1  I5 R4: I2  I4
R4: I2  I3 R2: I3  I5
R4: I2  I4
R2: I3  I5

7. Given the datapath of a single-cycle computer and the definition and formats of
its instructions:
PCSrc

1
Add M
u
x
4 ALU 0
Add result
RegWrite Shift
left 2
_
Instruction [25 21] Read
Read register 1 Read MemWrite
PC _ data 1
address Instruction [20 16] Read ALUSrc MemtoReg
Instruction register 2 Zero
_ 1 Read ALU ALU
[31 0] Write data 2 1 Read
M result Address 1
_ u register M data
Instruction Instruction [15 11] x u M
memory Write x u
0 data Registers x
0
Write Data 0
_
RegDst data memory
Instruction [15 0] 16 Sign 32
extend ALU MemRead
control
_
Instruction [5 0]

ALUOp

116
add $rd, $rs, $rt #rd = $rs + $rt R-format
lw $rt, addr($rs) #$rt = Memory[$rs + sign-extended addr] I-format
beq $rs, $rt, addr #if ($rs = $rt) go to PC + 4 + 4 × addr I-format

Field size 6 bits 5 bits 5 bits 5 bits 5 bits 6 bits


R-format op rs rt rd shamt funct
I-format op rs rt address/immediate
Complete the following table for the setting of the control lines of each instruction.
Assume that the control signal Branch and the Zero output of the ALU are ANDed
together to become the control signal PCSrc.
Instruction RegDst ALUSrc MemtoReg RegWrite MemRead MemWrite Branch
add
lw
beq
(1: asserted of control line, 0: deasserted of control line, x: don't care)

Answer:
Instruction RegDst ALUSrc MemtoReg RegWrite MemRead MemWrite Branch
add 0 1 0 1 0 0 0
lw 1 0 1 1 1 0 0
beq × 1 × 0 0 0 1
註:本題有誤,依據題目的圖及說明 branch 由上到下不論填 1, 1, 0 或 0, 0, 1 都
無法 work

8. Suppose we have a processor with a base CPI of 1.0, assuming all references hit
in the primary cache, and a clock rate of 5 GHz. Assume a main memory access
time of 100 ns, including all the miss handling. Suppose the miss rate per
instruction at the primary cache is 2%. How much faster will the processor be if
we add a secondary cache that has a 5 ns access time for either a hit or miss and is
large enough to reduce the miss rate to main memory to 0.5%?

Answer:
The miss penalty to main memory is 100 / 0.2 = 500 clock cycles
For the processor with one level of caching, total CPI = 1.0 + 500 × 2% = 11.0
The miss penalty for an access to the second-level cache is 5 / 0.2 = 25 clock
cycles
For the two-level cache, total CPI = 1.0 + 2% × 25 + 0.5% × 500 = 4.0
Thus, the processor with the secondary cache is faster by 11.0/4.0 = 2.8

117
9. According to the following MIPS program, complete the given table by using the
technique of loop unrolling for superscalar pipelines. Write your answer with the
leading (1), (2), ... and (9).
Loop: lw $t0, 0($s1)
addu $t0, $t0, $s2
sw $t0, 0($s1)
addi $s1, $s1, -4
bne $s1, $zero, Loop
ALU or branch inst. Data transfer inst. Clock cycle
Loop: addi $s1, $s1, -16 lw $t0, 0($s1) 1
(blank) lw $t1, 12($s1) 2
(1) lw $t2, 8($s1) 3
(2) (6) 4
addu $t2, $t2, $s2 (7) 5
(3) (8) 6
(4) (9) 7
(5) sw $t3, 4($s1) 8

Answer:
(1) addu $t0, $t0, $s2
(2) addu $t1, $t1, $s2
(3) addu $t3, $t3, $s2
(4) (blank)
(5) bne $s1, $zero, Loop
(6) lw $t3, 4($s1)
(7) sw $t0, 16($s1)
(8) sw $t1, 12($s1)
(9) sw $t2, 8($s1)

10. The total number of bits needed for a cache is the summation of data, tags, and
valid-bit. Assuming the 32-bit byte address, a direct-mapped cache of size 2n
blocks with 2m-word (2m+2-byte) blocks. What is the number of bits in such a
cache?

Answer:
The size of a tag = 32 – (n + m + 2) = 30 – n – m
The size of cache = 2n × (32 × 2m + 30 – n – m + 1) bits

118
11. (a) Assume an instruction cache miss rate for a program is 2% and a data cache
miss rate is 4%. If a processor has a CPI of 2 without any memory stalls and
the miss penalty is 100 cycles for all misses, determine how much faster a
processor would run with perfect instruction and data caches that never
missed. Assume the frequency of all loads and stores is 36%.
(b) Suppose we increase the performance of the processor by doubling its clock
rate. How much faster will the processor be with the faster clock, assuming
the same miss rates and the absolute time to handle a cache miss does not
change?

Answer:
(a) CPI considering miss penalty = 2 + 0.02 × 100 + 0.04 × 0.36 × 100 = 5.44
5.44 / 2 = 2.72 times faster
(b) Measured in faster clock cycle, the new miss penalty will be 200 cycles
Total miss cycles per instruction = 2% × 200 + 36% × (4% × 200) = 6.88
Faster system with cache miss, CPI = 2 + 6.88 = 8.88
Slower system with cache miss, CPI = 5.44
The faster clock system will be
Execution time of slow clock/Execution time of faster clock
= I × CPI_slow × Cycle time / (I × CPI_fast × ½ × Cycle time)
= 5.44 / (8.88 × ½ ) = 1.23 times faster

12. This is about I/O system design. Consider the following computer system: (1) a
CPU that sustains 3 billion instructions per second and averages 100,000
instructions in the operating system per I/O operation, (2) a memory backplane
bus capable of sustaining a transfer rate of 1000 MB/sec, (3) SCSI Ultra320
controllers with a transfer rate of 320 MB/sec and accommodating up to 7 disks,
and (4) disk drives with a read/write bandwidth of 75 MB/sec and, an average
seek plus rotational latency of 6 ms. If the workload consists of 64 KB reads
(where the block is sequential on a track) and the user program needs 200,000
instructions per I/O operation, find the maximum sustainable I/O rate and the
number of disks and SCSI controllers required. Assume that the reads can always
be done on an idle disk if one exists (i.e., ignore disk conflicts)

Answer:
The two fixed components of the system are the memory bus and the CPU. Let's
first find the I/O rate that these two components can sustain and determine which
of these is the bottleneck. Each I/O takes 200,000 user instructions and 100,000
OS instructions,
so Maximum I/O rate of CPU
Instruction execution rate 3  109 I/Os
=   10,000
Instructions per I/O 200  100  103 second

119
Each I/O transfers 64 KB, so
Bus bandwidth 1000  10 6 I/Os
Maximum I/O rate of bus =   15,625
Bytes per I/O 64  103 second
The CPU is the bottleneck, so we can now configure the rest of the system to
perform at the level dictated by the CPU, 10,000 I/Os per second.
Let's determine how many disks we need to be able to accommodate 10,000 I/Os
per second. To find the number of disks, we first find the time per I/O operation
at the disk:
Time per I/O at disk = Seek + rotational time + Transfer time
64 KB
= 6 ms + = 6.9 ms
75 M B/sec
Thus, each disk can complete 1000 ms/6.9 ms or 146 I/Os per second. To saturate
the CPU requires 10,000 I/Os per second, or 10,000/146 ≈ 69 disks.
To compute the number of SCSI buses, we need to check the average transfer rate
per disk to see if we can saturate the bus, which is given by
Transfer size 64 KB
Transfer rate =   9.56 MB/sec
Transfer time 6.9 ms
The maximum number of disks per SCSI bus is 7, which won't saturate this bus.
This means we will need 69/7, or 10 SCSI buses and controllers.

120
93 年交大資工

1. (1) What is the disadvantage if without applying Amdahl‘s law?


(2) What is the meaning of CPI = 1.5?
(3) Can the CPI be smaller than 1? Why?
(4) Do we need to know the ISA while design a good compiler? Why?

Answer:
(1) 由Amdahl‘s law我們可以知道系統有哪些部分還有改善空間哪些部分沒
有。如果不應用Amdahl‘s law,我們可能會浪費成本去改善一些對系統效
能沒有太大幫助的部分。
(2) 平均執行每個指令需要1.5個CPU clock cycles
(3) Yes, 像是superscalar這種computer一個clock可以issue兩個以上的指令其
CPI就會低於1。因此所需的total clock個數也會少於指令個數。
(4) Yes, 像是superscalar這種computer一個clock可以issue兩個以上的指令其所
需的total clock個數也會少於指令個數,因此CPI就會低於1。

2. Please draw the formats of five MIPS addressing modes.

Answer:
1. Immediate addressing
op rs rt Immediate

2. Register addressing
op rs rt rd ... funct Registers
Register

3. Base addressing
op rs rt Address Memory

Register + Byte Halfword Word

4. PC-relative addressing
op rs rt Address Memory

PC + Word

5. Pseudodirect addressing
op Address Memory

PC Word

121
3. Compare the number of gate delays for the critical paths of two 16-bit adders, one
using ripple carry and the other using two-level carry lookahead.

Answer:
(1) Ripper carry adder中每一個bit計算carry的部分都有2個gate的delays,因此
16-bit ripper carry adder的critical path共有2 × 16 = 32 gate delays.
(2) 在carry lookahead adder中gi和pi的產生需要1個gate delay。第一層進位的
產生以pi和gi為輸入需2個gate delay而第二層進位的產生以Pi和Gi為輸入
也需要2個gate delays,其中Pi,Gi的產生可以和第一層進位同時計算因此
Carry lookahead adder的critical path共有1 + 2 + 2 = 5 gate delays.

註:本題出自算盤本第二版課本例題,依課文敘述critical path之定義應為產生
carry之path

4. Prove that the Booth‘s algorithm work for multiplication of two‘s complement
signed integers.

Answer:
Suppose that a is multiplier, b is multiplicand, and ai is the ith bit of a. The booth‘s
algorithm implements the following computation:
(a 1  a 0 )  b  2 0
ai ai-1 Operation ai-1 - ai Operation
 (a 0  a1 )  b  21
0 0 Do nothing 0 Do nothing
 (a1  a 2 )  b  2 2
0 1 Add b +1 Add b
... . ....
1 0 Subtract b -1 Subtract b
 (a 29  a30 )  b  2 30
1 1 Do nothing 0 Do nothing
 (a30  a31 )  b  2 31

      
b  a31  2 31  a30  2 30  a 29  2 29  .....  a1  21  a0  2 0   
 ba

5. According to the following figure, what are the values of a(l), a(2), ..., d(7) in the
table?
Memto Reg Mem Mem
Instruction RegDst ALUSrc Branch ALUOpl ALUOp0
Reg Write Read Write
R-format a(l) a(2) a(3) a(4) a(5) a(6) a(7) 1 0
Lw b(l) b(2) b(3) b(4) b(5) b(6) b(7) 0 0
Sw c(l) c(2) c(3) c(4) c(5) c(6) c(7) 0 0
beq d(l) d(2) d(3) d(4) d(5) d(6) d(7) 0 1

122
0
M
u
x
ALU
Add result 1
Add Shift PCSrc
RegDst left 2
4 Branch
MemRead
Instruction [31 26] MemtoReg
Control
ALUOp
MemWrite
ALUSrc
RegWrite

Instruction [25 21] Read


PC Read register 1
address Read
Instruction [20 16] data 1
Read
register 2 Zero
Instruction
_ 0 Registers Read ALU ALU
[31 0] Write 0 Read
M data 2 result Address 1
Instruction u register M data
u M
memory Instruction [15 11] x u
1 Write x Data x
data 1 memory 0
Write
data

Instruction [15 0] 16 32
Sign
extend ALU
control
Instruction [5 0]

Answer:
Memto Reg Mem Mem
Instruction RegDst ALUSrc Branch ALUOpl ALUOp0
Reg Write Read Write
R-format 1 0 0 1 0 0 0 1 0
Lw 0 1 1 1 1 0 0 0 0
Sw × 1 × 0 0 1 0 0 0
beq × 0 × 0 0 0 1 0 1

6. Why it may cause the stale data problem during DMA I/O data transfer? How to
overcome it? Give three different approaches to resolve this problem and explain
your reasons clearly.

Answer:
(1) Consider a read from disk that the DMA unit places directly into memory. If
some of the locations into which the DMA writes are in the cache, the
processor will receive the old value when it does a read. Similarly, if the
cache is write-back, the DMA may read a value directly from memory when a
newer value is in the cache, and the value has not been written back. This is
called the stale data problem.
(2)
1. One approach is to route the I/O activity through the cache. This ensures
that reads see the latest value while writes update any data in the cache.
2. A second choice is to have the OS selectively invalidate the cache for an
I/O read or force write-backs to occur for an I/O write (often called cache
flushing).

123
3. The third approach is to provide a hardware mechanism for selectively
flushing (or invalidating) cache entries,

7. What is called split transaction protocol used in I/O data bus design?

Answer:
A method for increasing the effective bus bandwidth is to release the bus when it
is not being used for transmitting information.

8. There are three ways to schedule the branch delay slot in order to reduce or
eliminate the control hazard. Give simple examples to explain their principle
clearly and briefly.

Answer:
如下圖所示branch delay slot的安排有下列3種: (a) from before, (b) from target,
and (c) from fall through. (a) is the best. Use (b) (c) when (a) is impossible (data
dependency). (b) is only valuable when branch taken. It is OK to execute this
instruction even the branch is non-taken (c) is only valuable when branch not
taken; It is OK to execute this instruction even the branch is taken.

124
9. It is well known that multi-level cache design is one of the most important ways
to upgrade the CPU performance. Suppose that we have a processor with a base
CPI of 1.0, assuming all references hit in the primary cache, and a clock rate of
1000 MHz. Assume a main memory access time of 200 ns, including all miss
handing. Suppose the miss rate per instruction at the primary cache is 5%. How
much faster will the machine be if we add a secondary cache that has a 20 ns
access time for either a hit or a miss and is large enough to reduce the miss rate to
main memory to 2%? What are the global miss rate as well as local miss rate for
this two level cache machine?

Answer:
(1) The CPU clock cycle time = 1/1000MHz = 1 ns
CPI for one-level cache = 1 + 200 × 0.05 = 11
CPI for two-level cache = 1 + 20 × 0.05 + 200 × 0.02 = 6
The machine with two-level cache will be faster than machine with two-level
cache by 11/6 = 1.83
(2)
Miss rate Primary cache Secondary cache
Global 5% 2%
Local 5% 0.02/0.05 = 40%

125
92 年交大資工

1. A base processor and two options for improving its hardware and compiler design
are described as follows :
(a) The base machine, Mbase:
Mbase has a clock rate of 200 MHz and the following measures:
Instruction class CPI Frequency
A 2 50%
B 3 20%
C 4 30%
(b) The machine with improved hardware, Mhw:
Mhw has a clock rate of 250 MHz and the following measures :
Instruction class CPI Frequency
A 1 50%
B 2 20%
C 3 30%
(c) The combination of the improved compiler and the base machine, Mcomp:
The instruction improvements from this enhanced compiler are: as follows :
Instruction class % of instructions executed v.s. Mbase
A 70%
B 80%
C 60%

(1) What is the CPI (clock cycles per instruction) for each machine?
(2) How much faster is each of Mhw and Mcomp than Mbase?

Answer:
(1) CPIMbase = 0.5 × 2 + 0.2 × 3 + 0.3 × 4 = 2.8
CPIMhw = 0.5 × 1 + 0.2 × 2 + 0.3 × 3 = 1.8
Suppose there are I instructions to be run in Mbase then there are
0.5 × 0.7 × I + 0.2 × 0.8 × I + 0.3 × 0.6 × I = 0.69 I instruction to be run in
Mcomp
So, CPIMcomp = (0.5×0.7×2 + 0.2×0.8×3+0.3×0.6×4) I /0.69I = 2.75
I  2.8 I  2.8
6
(2) ExeTime of M base  200  10  1.94 ExeTime of M base  200  10  1.48
6
ExeTime of M hw I  1.8 ExeTime of M comp 0.69 I  2.75
250  10 6 200  10 6

126
2. (a) Describe the basic concepts and advantages of Booth‘s algorithm.
(b) Explain the difference between the restoring division algorithm and the
non-restoring division algorithm.
(c) Calculate the largest and smallest positive normalized numbers for the IEEE
754 standard single-precision floating-point operand format.

Answer:
(a) Booth‘s algorithm replace a string of 1s in the multiplier with an initial
subtract when we first see a 1 and then later add when we see the bit after the
last 1. If machines perform a shifting faster than an addition then for average
Booth‘s algorithm will speed the computation. Besides, Booth‘s algorithm
handles signed numbers well.
(b) 不像restoring division algorithm當被除數不夠減時會立即將除數累加回餘
數暫存器,non-restoring division algorithm不立即將除數加回餘數暫存器
而是在下一個迴圈時再將除數加至餘數暫存器。這樣做的目的可以減少
加、減法的次數。
(c) Largest number: 1.1111 1111 1111 1111 1111 111 × 2127
Smallest number: 1.0 × 2-126

3. Describe the following different implementations of a computer and compare


their advantages and disadvantages: single-cycle, multi-cycle, and pipelined
implementations.

Answer:
Implementation Single-cycle Multi-cycle Pipelined
An implementation An implementation in
An implementation
in which an which multiple
in which an
instruction is instructions are
Difference instruction is
executed in overlapped in
executed in one
multiple clock execution, much like
clock cycle
cycles to an assembly line
Less hardware
Advantage Simple High performance
overhead
Need to resolve
Disadvantage Poor performance Control is complex
hazards

127
4. It is well known that control hazard is one of the main bottlenecks during
pipelining execution of instructions. You are required to describe the principles of
the following two resolving methods by using simple examples clearly and
briefly.
(a) What is called delayed branch technique? How to utilize the delay slot by
inserting appropriate instruction instead of no-op?
(b) How to use 2-bit branch prediction technique to reduce the branch hazard
penalty?

Answer:
(a) Compiler detects branch instructions and rearranges instruction sequence to
eliminate the branch hazard penalty. We can place an instruction that is not
affected by the branch (safe instruction) in the delayed branch delay slot.
For example, the safe instruction ―add $s1, $s2, $s3‖ in the following graph
can be moved to the delay slot.

become

(b) 2-bit branch prediction technique要連續猜錯2次才會改猜其他值,因此可以


提高猜對率並減少branch hazard penalty.

5. What is called n-way set associative address mapping technique used in cache
memory design? Give an example to explain its principle clearly. Usually, it has
better hit ratio than that of direct mapping technique. Is it true? Why?

Answer:
In a set-associative cache that has a fixed number of locations (at least two) were
each block can be placed; a set-associative cache with n locations for a block is
called an n-way set associative cache. For small size of cache, it is true that the
set-associative mapping has better hit ratio than that of direct mapping because it
can reduce the miss rate due to conflict miss.

6. Compare the main differences among the following three I/O data transfer
techniques: polling, interrupt, and DMA. Also describe their main advantages and
disadvantages clearly and briefly

Answer:

128
Types Polling Interrupt DMA
The processor I/O devices employs DMA approach
periodically checking interrupts to indicateprovides a device
the status of an I/O to the processor that controller the ability
device to determine they need attention to transfer data
Differences
the need to service directly to or from
the device the memory without
involving the
processor
Simple Can eliminates the DMA can be used to
need for the interface a hard disk
processor to poll the without consuming
Advantages
device and allows the all the processor
processor to focus on cycles
executing programs
Waste a lot of More complex than Require hardware
Disadvantages
processor time polling support

7. Given the datapath for a multi-cycle computer and the definition and formats of
its instructions,
add $rd, $rs, $rt #$rd = $rs + $rt R-format
lw $rt, addr($rs) #$rt = Memory[$rs + addr] I-format
sw $rt, addr($rs) #Memory[$rs + addr] = $rt I-format
beq $rs, $rt, addr #if ($rs = $rt) goto PC + 4 + 4 × addr I-format
j addr #go to 4 × addr J-format

Name Fields Comments


Field size 6 bits 5 bits 5 bits 5 bits 5 bits 6 bits all MIPS instructions 32 bits
R-format op rs rt rd shamt funct arithmetic instruction format
I-format op rs rt address/immediate transfer, branch, imm format
J-format op target address jump instruction format

129
(a) Write the steps taken to execute the add instruction. How many clock cycles
are required for this instruction?
(b) Repeat (a) for the lw instruction.
(c) Repeat (b) for the beq instruction.

Answer:
(a) instruction fetch,instruction decode and register fetch, execution,
write back. 共4個clock cycles.
(b) instruction fetch,instruction decode and register fetch, address
calculation, memory access, write back. 共5個clock cycles.
(c) instruction fetch,instruction decode and register fetch, (include target
address calculation) branch completion. 共3個clock cycles.

130
96 年交大電子所

1. PC-relative addressing
(a) What is the PC-relative addressing?
(b) What is the major advantage for this addressing mode?
(c) What is the major limitation of this addressing mode implemented in a RISC
with the fixed-length instruction format?
(d) Assume instructions are always word-aligned and the immediate filed is
12-bit long. What is the target range that a PC-relative branch instruction can
go to? (a word = 4 bytes)

Answer:
(a) PC-relative addressing: where the address is the sum of the PC and a constant
in the instruction.
(b) PC-relative addressing is useful in connection with conditional jumps,
because we usually only want to jump to some nearby instruction. Another
advantage of program-relative addressing is that the code may be
position-independent, i.e. it can be loaded anywhere in memory without the
need to adjust any addresses.
(c) The range that a PC-relative branch instruction can go to is limited.
(d) About  211 words of the current instruction.

2. Pipelining
In general, the speedup of a 5-stage pipelined scalar processor against its
non-pipelined counterpart hardly achieves 5. Please give at least 4 reasons.

Answer:
(1) The stages may be imperfectly balanced.
(2) The delay due to pipeline register.
(3) Data hazard and control hazard.
(4) Time to ―fill‖ pipeline and time to ―drain‖ it reduces speedup

3. Virtual memory
(a) Why can the virtual memory mechanism provide the memory protection
among processes in a multi-processing environment?
(b) Describe how a virtual address is translated into a physical address.
(c) What is the translation lookaside buffer (TLB) designed for? How does it
work?
(d) What are the benefits if a larger page size is chosen? (List at least 3.) What is
the drawback? (List at least 1.)

Answer:
(a) The hardware provides three basic capabilities to protect among processes.

131
1. Support two modes that indicate whether the running process is a user
process or an operating system process called supervisor process.
2. Provide a portion of the processor state that a user process can read but
not write.
3. Provide mechanisms whereby the processor can go from user mode to
supervisor mode, and vice versa.
(b) In virtual memory systems, page table is used to translate a virtual address
into a physical address. The page table containing the virtual to physical
address translations.
(c) TLB is used to make address translation fast. TLB (a cache of a page table)
keep track of recently used address mappings to avoid an access to the page
table.
(d) Benefits: (1) more efficient to amortize the high access time
(2) more spatial and temporal localities
(3) smaller page table size
Drawbacks: (1) more internal fragmentation
(2) higher miss penalty

4. Amdahl's law
Amdahl's law is useful to predict the expected speedup for certain technique.
Now apply it to the power reduction. Assume the power of a circuit is
proportional to V2F, where V is the supply voltage and F is the working clock
frequency. Assume the throughput is directly proportional to F.
(a) For a technique A that can reduce V by a factor of 2. Assume the technique A
only affects 40% of the circuit. Please derive the power reduction factor.
(b) Following (a), further assume if reducing V by a factor of 2 will result in the
maximum frequency F reduced by 80% for that part and 20% for other part.
Please derive the improvement of throughput-power product.

Answer:
V 2F 1
(a) Power reduction factor =   1.43
0.4  (V / 2) 2 F  0.6  V 2 F 0.7
(b) Power = 0.4  (V/2)2  F  (1 – 0.8) + 0.6  V2  F  (1 – 0.2) = 0.5V2 F
Throughput-power product = (0.4  0.2F + 0.6  0.8F)  0.5V2 F = 0.28V2F2
F V 2F
Throughput-power product improvement=  3.75
0.28V 2 F 2

註(b):throughput在此與頻率F成正比,因此要求throughput-power product只要將
F乘以power即可

132
95 年交大電子所

1. Performance enhancement:
(1) You have two possible improvements on a computer: either make multiply
instructions run four times faster than before, or make memory access
instructions run two times faster than before. You repeatedly run a program
that takes 10000 seconds to execute. Of this time, 20% is used for
multiplication, 50% for memory access instructions, and 30% for other tasks.
(a) What will the speedup be if you improve only memory access?
(b) What will the speedup be if both improvements are made?
(2) If the gate delay of AND/OR/XOR are the same, 2ns, (regardless of the
number of inputs), show the design of the 16-bit two level carry look ahead
adder (first level is 4-bit in a group) and the speedup over the 16-bit ripple
adder.
(3) In the design of instruction set, one principle is "make the common case fast".
Take one example of MIPS instruction set design to illustrate this principle.

Answer:
(1)
1
(a) Speedup =  1.33
0.5
 0.5
2
1
(b) Speedup =  1.67
0.2 0.5
  0.3
4 2
(2)
Propagate: (表示一個4位元加法器的傳遞)
P0 = p3·p2·p1·p0
P1 = p7·p6·p5·p4
P2 = p11·p10·p9·p8
P3 = p15·p14·p13·p12
Generate: (表示一個4位元加法器的產生)
G0 = g3 + (p3·g2) + (p3·p2·g1) + (p3·p2·p1·g0)
G1 = g7 + (p7·g6) + (p7·p6·g5) + (p7·p6·p5·g4)
G2 = g11 + (p11·g10) + (p11·p10·g9) + (p11·p10·p9·g8)
G3 = g15 + (p15·g14) + (p15·p14·g13) + (p15·p14·p13·g12)
4 位元加法器的進位(Ci):
C1 = G0 + c0·P0
C2 = G1 + G0·P1 + c0·P0·P1
C3 = G2 + G1·P2 + G0·P1·P2 + c0·P0·P1·P2
C4 = G3 + G2·P3 + G1·P2·P3 + G0·P1·P2·P3 + c0·P0·P1·P2·P3

133
b15
a1
a15
b14
a14
b13
a13
b12
a12

b11
a11
b10
a10
b9
a9
b8
a8

b7
a7
b6
a6
b5
a5
b4
a4

b3
a3
b2
a2
b1
a1
b0
a0
CarryIn

CarryIn

CarryIn

CarryIn

CarryIn
ALU3

ALU2

ALU1

ALU0
G3

G2

G1

G0
P3

P2

P1

P0
CarryOut

C4

C3

C2

C1
ci + 4
gi + 3
pi + 3

ci + 3

gi + 2
pi + 2

ci + 2

gii + 1
g
pi + 1

ci + 1

gi
pi
Carry-lookahead unit
Result12--15

Result8--11

Result4--7

Result0--3
Ripper carry adder中每一個bit計算carry的部分都有2個gate的delays,因此
16-bit ripper carry adder的critical path delay為2 × 16 × 2 ns = 64 ns。而carry
lookahead adder中gi和pi的產生需要1個gate delay。第一層進位的產生以pi
和gi為輸入需2個gate delay而第二層進位的產生以Pi和Gi為輸入也需要2
個gate delays,其中Pi,Gi的產生可以和第一層進位同時計算因此Carry
lookahead adder的critical path delay為(1 + 2 + 2) × 2ns = 10 ns。Speedup =
64/10 = 6.4
(3) 例如,使用PC-相對定址法於分支跳躍及使用立即定址法以得到常數
運算元。

2. Pipelining:
(1) ―Execution time = Instruction Count * Cycle Per Instruction * Cycle Time‖ is
a popular performance equation. Explain why the pipelining technique can
increase the computer performance in terms of this equation.
(2) In reality, it is impossible to get an n-fold speedup by an n-stage pipelining.
Give the reasons.

Answer:
(1) (a) Pipeline can reduce the average CPI by overlapping the execution of
instructions
(b) By dividing the datapath into stages, pipeline can increase the clock rate
and thus shorten the cycle time.
(2) (a) The stages may be imperfectly balanced.
(b) The delay due to pipeline register.
(c) Data hazard and control hazard.
(d) Time to ―fill‖ pipeline and time to ―drain‖ it reduces speedup

134
3. Pipeline hazards:
(1) Identify types of data hazards and explain them briefly.
(2) What techniques can be used to reduce the performance penalty caused by
control hazards? List at least 3 techniques.

Answer:
(1) Read after Write (RAW): the first instruction may not have finished writing to
the operand, the second instruction may use incorrect data.
Write after Read (WAR): the write may have finished before the read, the
read instruction may incorrectly get the new written value.
Write after Write (WAW): Two instructions that write to the same operand
are performed. The first one issued may finish second, and therefore leave the
operand with an incorrect data value.
(2)  Branch prediction
 Move branch decision earlier
 Delayed branch

4. Memory system:
(1) What is the main objective of the memory hierarchy?
(2) What is the fundamental principle that makes the memory hierarchy work?
Describe the principle briefly.
(3) Briefly describe the three common strategies for each block placement.
(4) Compare the strategies mentioned in (3) in terms of the cache miss rate and
the hardware implementation cost.

Answer:
(1) 提供使用者能用最便宜的技術來擁有足夠的記憶體,並且利用最快的記
憶體來提供最快的存取速度。
(2) 區域性原則(Principle of locality):程式在任何一個時間點只會存取一小
部份的位址空間稱為區域性。區域性可分為兩種不同的類型:
時間區域性(temporal locality):如果一個項目被存取到,那麼它很快的
會被再存取到。
空間區域性(spatial locality):如果一個項目被存取到,那麼它位址附近的
項目也會很快被存取到。
(3) Direct-mapped cache: A cache structure in which each memory location is
mapped to exactly one location in the cache.
Set-associative cache: A cache that has a fixed number of locations (at least
two) where each block can be placed.
Fully associative cache: A cache structure in which a block can be placed in
any location in the cache.
(4)

135
strategies miss rate Cost
Direct-mapped High Low
Set-associative Medium Medium
Fully associative Low High

136
94 年交大電子所

1. A 1-bit full adder cell is represented as the following symbol.


b a

CO + Ci

S
Assume the time delay is T for all input to output paths. Design an adder for R =
A + B + C + D by ONLY using these full adder cells. A, B, C, and D are all 4-bit
2's complement values. A can also be represented as {A3, A2, A1, A0} where A3 is
the MSB and A0 is the LSB. The same rule applies to B, C and D.
(1) What is the minimum bit width for R to be able to store all possible results?
(2) Draw your design in terms of the given symbol. You should minimize the
number of required adder cells. Report the number of adder cells used in your
design as well.
(3) What is the worst-case time delay of your adder design in terms of T? You
should minimize this time delay.

Answer:
(1) 6 bits (hint: the addition of two n-bit numbers yields a (n + 1)-bit result.)
(2) 12 adder cells
c3 d3 c2 d2 c1 d1 c0 d0

(3) 6T

137
2. Three enhancements with the following speedups are proposed for a new
architecture: speedup1 = 30, speedup2 = 20, speedup3 = 15. Only one
enhancement is usable at a time.
(1) Please derive the Amdahl's law for multiple enhancements but each is usable
at a time, that is, list the speedup formula, assume FEi is the fraction of time
that enhancement i can be used and SEi, is the speedup of enhancement i. For
a single enhancement the equation reduces to the familiar form of Amdahl's
Law.
(2) If enhancements 1 and 2 are each usable for 25% of the time. What fraction of
the time must enhancement 3 be used to achieve an overall speedup of 10?
(3) Assume the enhancements can be used 25%, 25% and 10% of the time for
enhancements 1, 2, 3, respectively. For what fraction of the reduced execution
time is no enhancement in use?
(4) Assume, for some benchmark, the possible fraction of use is 15% for each of
enhancements 1 and 2 and 70% for enhancement 3. We want to maximize
performance. If only one enhancement can be implemented, which should it
be? If two enhancements can be implemented, which should be chosen.

Answer:
1
(1) Speedup 
i 1 SE i  1  i 1 FE i 
3 FE 3
i
1
(2) 10   f  45%
  1  0.25  0.25  f 
0.25 0.25 f

30 20 15
(3) Suppose t is the execution time before improve
0.25 0.25 0.1
Execution time after improvement = t  (    0.4 ) = 0.43 t
30 20 15
The fraction is (0.43 t) / t) = 0.43

1
(4) Speedup1 =  1.1695
 1  0.15
0.15
30
1
Speedup2 =  1.1661
 1  0.15
0.15
20
1
Speedup3 =  2.8851
 1  0.7 
0.7
15
If only one enhancement can be implemented, enhancement 3 should be
chosen.
1
Speedup12 =  1.4
 1  0.15  0.15
0.15 0.15

30 20

138
1
Speedup13 =  4.96
 1  0.15  0.7 
0.15 0.7

30 15
1
Speedup23 =  4.9
 1  0.15  0.7 
0.15 0.7

20 15
If two enhancements can be implemented, enhancement 1 & 3 should be
chosen.

3. Consider a virtual memory system with the following characteristics:


a. Total of 1 million pages;
b. 4K bytes of space in each page;
c. Each entry within the page table has 12 bits, including 1 valid bit.
(1) What is the total addressable physical memory space using the virtual
memory system?
(2) How many bits are required for the virtual address, including bits for  the
virtual page number and  the page offset?
(3) Explain the meaning of page fault in a virtual memory system. What will
happen to the valid bit within the page table if page fault occurs?

Answer:
(1) The size of a physical page number = 12 – 1 = 11  The number of pages in
physical memory = 211 = 2K
The total addressable physical memory space = 2K  4KB = 8 MB
(2)  Since the virtual memory has 1 million = 220 pages  the size of a virtual
page number = 20 bits
 Since there are 4K = 212 bytes in each page  the size of page offset = 12
bits
(3) Page fault: An even that occurs when an accessed page is not present in main
memory.
If the valid bit for a virtual page is off, a page fault occurs.

139
4. The following figure shows the pipelined datapath with the control signals
identified. A sequence of MIPS instructions is given as follows:
add $s1, $s2. $s3 # Register $s1l = $s2 + $s3
sw $s1, 100 ($s4) # Store register $s1 into Memory [$s4 + 100]
(1) Because the setting of the control lines depends only on the opcode values
(Instruction [31-26]), we define whether control signal should be 0 (not
activated), 1 (activated), or X (don't care), for each of the instructions.
Complete the table by specifying the values of A, B, C, D and E.
EX stage MEM stage WB stag
Instruction Reg ALU ALU ALU Mem Mem Reg Memto
Branch
Dst Op1 Op0 Src Read Write Write Reg
add 1 1 0 0 0 A 0 1 B
sw C 0 0 1 0 0 1 D E
(2) The inputs carrying the register number to the register file are all 5 wide in the
machine language instructions. Identify the bit ranges of Read register 1 and
Read register 2 separately.
(3) The above two instructions are dependent; that is, the sw instructions uses the
results calculated by the add instruction. To resolve this hazard, we must first
detect such a hazard and then forward the proper value. Specify all the
necessary inputs to the forwarding unit (not shown in the figure) so that any
data dependence can be detected. One input is done for you,
ID/EX.Instruction [20-16].

Answer:
(1)
A B C D E
0 1 X 0 X

140
(2)
Read register 1 bit rages is [25-21]
Read register 2 bit rages is [20-16]
(3)
ID/EX.Instruction[25-21] (i.e., ID/EX.RegisterRs)
ID/EX.Instruction[20-16] (i.e., ID/EX.RegisterRt)
EX/MEM.RegWrite
EX/MEM.RegisterRd
MEM/WB.RegWrite
MEM/WB.RegisterRd

5. An I/O device tries to fetch data from memory using the following asynchronous
handshaking protocol: (grey signals are asserted by the I/O device, where
memory asserts the signals in solid black; numbered arrows are referred to the
following steps)
ReadReq 1
3
Data
2 4
2 6

4
Ack
5
7
DataRdy
The steps in the asynchronous protocol begin immediately after the I/O device
signals a request by raising ReadReq and putting the address on the Data lines:
(1) When memory sees the ReadReq line, it reads the address from the data bus
and raises Ack to indicate it has been seen.
(2) I/O device sees the Ack line high and releases the ReadReq and data lines.
(3) Memory sees the ReadReq is low and drops the Ack line to acknowledge the
ReadReq signal.
(4) This step starts when the memory has the data ready. It places the data from
the read request on the data lines and raises DataRdy.
(5) The I/O device sees DataRdy, reads the data from the bus, and signals that it
has the data by raising Ack.
(6) The memory sees the Ack signal, drops DataRdy, and releases the data lines.
(7) Finally, the I/O device, seeing DataRdy goes low, drops the Ack line, which
indicates that the transmission is completed.
A new bus transaction can now begin.
Your task is to implement the above asynchronous handshaking protocol in finite
state machine. Note that you only have to show the signal flow without hardware
design.

Answer:

141
New I/O request

I/O device Memory

___ ________
Ack ReadReq
Put address
on data
lines; assert
Read
ReadReq

Ack ReadReq
________
DataRdy ReadReq 1
2
Record from
Release data
data lines
lines; deassert
and assert
ReadReq
Ack
________
DataRdy ReadReq
___
DataRdy Ack 3, 4
5
Drop Ack;
Read memory
put memory
data from data
data on data
lines;
lines; assert
assert Ack
DataRdy
________
DataRdy Ack

6
7 Release data
Deassert Ack lines and
DataRdy
New I/O request

142
96 年成大資工

1. Consider a MIPS processor with an additional floating point unit. Assume


functional unit delays in the processor are as follows: memory (2 ns), ALU and
adders (2 ns), FPU add (8 ns), FPU multiply (16 ns), register file access (1 ns),
and the remaining units (0 ns). Also assume instruction mix as follows: loads
(31%), stores (21%), R-format instructions (27%), branches (5%). jumps (2%),
FP adds and subtracts (7%), and FP multiplys and divides (7%).
(1) What is the delay in nanosecond to execute a load, store, R-format, branch,
jump, FP add/subtract, and FP multiply/divide instruction in a single-cycle
MIPS design?
(2) What is the averaged delay in nanosecond to execute a load, store, R-format,
branch, jump, FP add/subtract, and FP multiply/divide instruction in a
multicycle MIPS design?

Answer:
(1) 20 ns
ALU/FPU add/ Delay
Instruction Memory Register Memory Register
FPU multiply (ns)
load 2 1 2 2 1 8
store 2 1 2 2 0 7
R-format 2 1 2 0 1 6
branch 2 1 2 0 0 5
jump 2 0 0 0 0 2
FP add/sub 2 1 8 0 1 12
FP mul/div 2 1 16 0 1 20
(2) Average delay = (5  0.31 + 4  0.21 + 4  0.27 + 3  0.05 + 3  0.02 +
4  0.07 + 4  0.07)  16 = 4.24  16 = 67.84 ns

2. Consider a cache with 4 memory blocks. Assume that the cache contains no
memory block initially. How many cache misses will be introduced by the
directed mapped, 2-way set associative and fully associative caches if the
memory blocks with addresses 0, 8, 0, 6 and 8 are fetched sequentially?

143
Answer:
2-way set
Directed mapped fully associative
associative
Block
Tag Index H/M Tag Index H/M Tag H/M
addresses
0 0 0 Miss 0 0 Miss 0 Miss
8 2 0 Miss 4 0 Miss 8 Miss
0 0 0 Miss 0 0 Hit 0 Hit
6 1 2 Miss 3 0 Miss 6 Miss
8 2 0 Miss 4 0 Miss 8 Hit
No. of misses 5 4 3

3. Which of the following techniques can resolve control hazard


(1) Branch prediction
(2) Stall
(3) Delayed branch

Answer: all these three techniques can resolve control hazard


(1) Execution of the branch instruction is continued in the pipeline by
predicting the branch is to take place or not. If the prediction is wrong,
the instructions that are being fetched and decoded are discarded
(flushed).
(2) Pipeline is stalled until the branch is complete. The penalty will be
several clock cycles.
(3) A delayed branch always executes the following instruction. Compilers
and assemblers try to place an instruction that does not affect the branch
after the branch in the branch delay slot.

4. Write a C program which exhibits the temporal and spatial localities. The C
program cannot exceed 5 lines.

Answer:
clear1(int array[ ], int size)
{
int i;
for (i = 0; i < size; i += 1)
array[i] = 0;
}

144
95 年成大資工

1. The following program tries to copy words from the address in register $a0 to the
address in register $a1 and count the number of words copied in register $v0. The
program stops copying when it finds a word equal to 0. You do not have to
preserve the contents of registers $v1, $a0, and $a1. This terminating word should
be copied but not counted.
Loop: lw $v1, 0($a0)
addi $v0, $v0, 1
sw $v1, 0($a0)
addi $a0, $a0, 1
addi $a1, $a1, 1
bne $v1, $zero, loop
There are multiple bugs in this MIPS program. Please fix them and turn in
bug-free version.

Answer:
addi $v0, $zero, -1
Loop: lw $v1, 0($a0)
addi $v0, $v0, 1
sw $v1, 0($a1)
addi $a0, $a0, 4
addi $a1, $a1, 4
bne $v1, $zero, Loop

2. (a) Fill in the following table using the index provided in the keywords (1)-(5) to
determine the 3-bit Booth algorithm. Assume that you have both the
multiplicand and twice the multiplicand already in registers.
(1) None (2) Add the multiplicand (3) Add twice the multiplicand (4) Subtract
the multiplicand (5) Subtract twice the multiplicand.
(b) Assume x is ―0101012‖ and y is ―0110112‖. Please use 2-bit and 3-bit Booth
algorithm to do the y*x operation.
(c) Will the 3-bit Booth algorithm always have a fewer operations than the 2-bit
Booth algorithm? Justify your answer by a brief description.
Current bits Previous bit
Operation
ai+1 ai ai-1
0 0 0
0 0 1
0 1 0
0 1 1
1 0 0
1 0 1
1 1 0
1 1 1

145
Answer:
(a)
Current bits Previous bit
Operation
ai+1 ai ai-1
0 0 0 (1)
0 0 1 (2)
0 1 0 (2)
0 1 1 (3)
1 0 0 (5)
1 0 1 (4)
1 1 0 (4)
1 1 1 (1)

(b) 2-bit
Iteration Step Multiplicand Product
0 initial values 011011 000000 010101 0
10 – Multiplicand 011011 100101 010101 0
1
Shift right product 011011 110010 101010 1
01 + Multiplicand 011011 001101 101010 1
2
Shift right product 011011 000110 110101 0
10 – Multiplicand 011011 101011 110101 0
3
Shift right product 011011 110101 111010 1
01 + Multiplicand 011011 010000 111010 1
4
Shift right product 011011 001000 011101 0
10 – Multiplicand 011011 101101 011101 0
5
Shift right product 011011 110110 101110 1
01 + Multiplicand 011011 010001 101110 1
6
Shift right product 011011 001000 110111 0

3-bit
Iteration Step Multiplicand Product
0 initial values 011011 000000 010101 0
010 + Multiplicand 011011 011011 010101 0
1
Shift right product 011011 000110 110101 0
010 + Multiplicand 011011 100001 110101 0
2
Shift right product 011011 001000 011101 0
010 + Multiplicand 011011 100011 011101 0
3
Shift right product 011011 001000 110111 1
(c) 3-bit version未必永遠比2-bit version執行較少次的運算,例如010101 
100110,3-bit version和2-bit version一樣都要執行3次加減。

146
3. Define zero, de-normalized number, floating point number, infinity, and NaN
(Not a number) in IEEE 754 double precision format by giving the range of their
exponents and significands, respectively. Fill in your answer in the following
table.
zero de-normalized floating point infinity NaN
Exponent
significand

Answer:
zero de-normalized floating point infinity NaN
Exponent 0 0 1 – 2046 2047 2047
significand 0 nonzero anything 0 nonzero

4. Consider a CPU with following instructions:


Instruction Example Meaning
add add $1, $2, $3 $1 = $2 + $3
sub sub $1, $2, $3 $1 = $2 - $3
There are five pipeline stages:
(1) IF-Instruction fetch
(2) ID-Instruction decode and register fetch
(3) EX-Execution or calculate effective address
(4) MEM-Access data memory
(5) WB-Write back to registers
Now, consider a program segment S:
add $1, $2, $3
sub $4, $1, $5
add $6, $1, $7
add $8, $4, $1
sub $2, $7, $9
(a) If we stall the pipeline when there is a data hazard (no forwarding), how many
cycles will it take to complete this program segment. Draw the resulting
pipeline.
(b) Is it possible to produce the same result in few cycles by reordering
instruction? If, so show the reordering, depict the new pipeline and indicate
how many cycles it will take to complete this program segment S.

Answer:
(a) Suppose that register read and write can happen in the same clock cycle.
Therefore, we should stall 2 cycles between lines 1 and 2, and stall 1 cycle
between lines 2 and 4.
The total cycles to complete the segment = (5 – 1) + 5 + 3 = 12 clock cycles

147
1 2 3 4 5 6 7 8 9 10 11 12
add IF ID EX MEM WB
sub IF ID ID ID EX MEM WB
add IF IF IF ID EX MEM WB
add IF ID ID EX MEM WB
sub IF IF ID EX MEM WB

(b) add $1, $2, $3


sub $2, $7, $9
sub $4, $1, $5
add $6, $1, $7
add $8, $4, $1
1 2 3 4 5 6 7 8 9 10 11
add IF ID EX MEM WB
sub IF ID EX MEM WB
sub IF ID ID EX MEM WB
add IF IF ID EX MEM WB
add IF ID ID EX MEM WB
The total cycles to complete the segment = (5 – 1) + 5 + 2 = 11 clock cycles

148
94 年成大資工

1. Use Booth algorithm to calculate the following:


(a) multiplicant × multiplier = 10111 × 10011 = –9 × –13 = 117
Iteration Step Multiplicand Product
00000 100110
0 initial values 10111
-10111
1 10111
2 10111
3 10111
4 10111
5 10111 00011 10101 1
(b) Prove the correctness of the Booth algorithm. The main idea of Booth
algorithm is that a sequence of k 1‘s (k additions) is replaced by one addition
and one subtraction. You must explain why only one subtraction (without
subtraction) is needed for the last sequence of 1‘s when multiplier is negative.

Answer: (a)
Iteration Step Multiplicand Product
0 initial values 10111 00000 100110
10 prod = prod - Mcand 10111 01001 100110
1
Shift right product 10111 00100 110011
11 no operation 10111 00100 110011
2
Shift right product 10111 00010 011001
01 prod = prod + Mcand 10111 11001 011001
3
Shift right product 10111 11100 101100
00 no operation 10111 11100 101100
4
Shift right product 10111 11110 010110
10 prod = prod - Mcand 10111 00111 010110
5
Shift right product 10111 00011 10101 1
(b) Suppose that a is multiplier, b is multiplicand, and ai is the ith bit of a. The
booth‘s algorithm implements the following computation:
(a 1  a 0 )  b  2 0
ai ai-1 Operation ai-1 - ai Operation
 (a 0  a1 )  b  21
0 0 Do nothing 0 Do nothing
 (a1  a 2 )  b  2 2
0 1 Add b +1 Add b
... . ....
1 0 Subtract b -1 Subtract b
 (a 29  a30 )  b  2 30 1 1 Do nothing 0 Do nothing
 (a30  a31 )  b  2 31

      
b  a31  2 31  a30  2 30  a 29  2 29  .....  a1  21  a0  2 0   
 ba
149
Multiplier 若為負數則最後一個連續的 1 最左邊的位元一定為 1,因此只要執
行一次減法運算即可。

2. What is the biased single precision IEEE 754 floating point format of 0.9375?
What is purpose to bias the exponent of the floating point numbers?

Answer:
(1) 0.9375ten = 0.1111two = 1.111two × 2-1
S E F
0 01111110 11100000000000000000000
(2) 多數浮點數運算需要先比較指數大小來決定要對齊哪一個數的指數,使
用bias表示法可以直接比較指數欄位的無號數值即可判斷兩個浮點數指
數的大小而不頇考慮其正負號,因此可以加快比較速度。

3. Why do the ripple carry adders perform additions in a sequential manner?


Carry-lookahead adder is one of the fast-carry schemes to improve the adder
performance over ripple carry adders. What is the principle of these fast-carry
schemes? Briefly explain.

Answer:
(1) 因為各位元之carry-in與carry-out是以串聯方式連接而且進位的傳遞是由
最小的位元逐一傳至最高位元。
(2) 任何位元之進位可由本級自行產生或經由傳遞前面任何一級位元所產生
之進位而得,因此各級進位不必像ripple carry adders一樣由最低位元傳
遞,所以速度較快。

4. Assume the following:


(1) k is the number of bits for a computer‘s address size (using byte addressing)
(2) S is cache size in bytes
(3) B is block size in bytes, B = 2b
(4) A stands for A-way associative cache
Figure out the following quantities in terms of S, B, A, and k:
(a) the number of sets in the cache
(b) the number of index bits in the address, and
(c) the number of bits needed to implement the cache

Answer:
Address size: k bits
Cache size: S bytes/cache
Block size: B = 2b bytes/block
Associativity: A blocks/set

150
 S 
Number of sets in the cache = S/AB; Number of bits for index = log 2  
 AB 
 S  S
Number of bits for the tag = k   log 2    b   b  k  log 2  
  A   A
Number of bits needed to implement the cache = sets/cache × associativity ×
(data + tag + valid) =
S  S  S  S 
 A   8  B  k  log 2    1    8  B  k  log 2    1 bits
AB   A  B   A 

151
93 年成大資工

1. Given the following bit pattern:


(0100 0000 0010 1101 1111 1000 0100 1101)two
What decimal number does it represent? Assume that it is an IEEE 754 single
precision floating point number.

Answer:
(–1)sign × (1+significand) × 2exponent – bias
= (–1)0 × (1 + 2-2 + 2-4 + 2-5 + 2-7 + 2-8 + 2-9 + 2-10 + 2-11 + 2-12 + 2-17 + 2-20 + 2-21 +
2-23) × 21
= 2 + 2-1 + 2-3 + 2-4 + 2-6 + 2-7 + 2-8 + 2-9 + 2-10 + 2-11 + 2-16 + 2-19 + 2-20 +2-22

2. Show the minimal MIPS instruction sequence for a new instruction called not that
takes the one‘s complement of a Source register and places it in a Destination
register. Convert this instruction (accepted by the MIPS assembler): not $s0, $s1
(Hint: It can be done in one instruction if you use the new logical instruction)
Answer: nor $s0, $s1, $zero

3. Consider the following measurements made on a pair of SPARCstation 10s


running Solaris 2.3, connected to two different types of networks, and using
TCP/IP for communication:
Characteristic Ethernet ATM
Bandwidth from node to network 1.25MB/sec 10MB/sec
Interconnect latency 18 s 42 s
HW latency to/from network 5 s 9 s
SW overhead sending to network 198 s 211 s
SW overhead receiving from network 249 s 356 s
(HW: Hardware, SW: Software)
Find the, host-to-host latency for a 250-byte message using each network.

Answer:
250 bytes
The transmission timeEthernet =  200 μs
1.25  10 6 byte/sec
250 bytes
The transmission timeATM =  25 μs
10  10 6 byte/sec
The total latency to send and receive the packet is the sum of the transmission
time and the hardware and software overhead:
Total timeEthernet = 198 + 5 + 18 + 5 + 249 + 200 = 675 s
Total timeATM = 211 + 9 + 42 + 9 + 356 + 25 = 658 s

152
4. Suppose there are a processor running at 1.5G Hz and a hard-disk. The hard disk
has a transfer rate of 8 MB/sec and uses DMA. Assume that the initial setup of a
DMA transfer takes 800 clock cycles for the processor, and assume the handling
of the interrupt at DMA completion requires 400 clock cycles for the processor. If
the average transfer from the disk is 16 KB, what fraction of this processor is
consumed if the disk is actively transferring 100% of the time? Ignore any impact
from bus contention between the processor and DMA controller.

Answer:
Each DMA transfer takes 16KB / (8MB/sec) = 2 × 10-3 seconds. So if the disk is
constantly transferring, it requires (800 + 400) / 2 × 10-3 = 600 × 103 clock
cycles/sec.
Fraction of processor consumed = (600 × 103) / (1.5 × 109) = 0.4 × 10-3 = 0.04%

153
92 年成大資工

1. Since assembly language is the interface to higher-level software, the assembler


can also treat common variations of machine language instructions as if they were
instructions in their own right. However, these instructions need not be
implemented in hardware. Such instructions are called pseudoinstructions. And
many such instruction sets appear in MIPS programs.
For each pseudoinstruction in the following table, produce a minimal sequence of
actual MIPS instructions to accomplish the same thing. You may need to use $at
for some of the sequences. In the following table, big refers to a specific number
that requires 32 bits to represent and small to a number that can be expressed
using 16 bits.
Pseudoinstruction What is accomplishes Solution
move $t1, $t2 $t1 = $t2 ex: add $t1, $t2, $zero
clear $t0 $t0 = 0
beq $t1, small, L if ($t1 = small) go to L
beq $t2, big, L if ($t2 = big) go to L
li $t1, small $t1 = small
li $t2, big $t2 = big
ble $t3, $t5, L if ($t3 <= $t5) goto L
bgt $t4, $t5, L if ($t4 > $t5) go to L
bge $t5, $t3, L if ($t5 >= $t3) go to L
addi $t0, $t2, big $t0 = $t2 + big
lw $t5, big($t2) $t5 = Memory[$t2+big]

Answer:
Pseudoinstruction What is accomplishes Solution
move $t1, $t2 $t1 = $t2 add $t1, $t2, $zero
clear $t0 $t0 = 0 add $t0, $zero, $zero
li $at, small
beq $t1, small, L if ($t1 = small) go to L
beq $t1, $at, L
li $at, big
beq $t2, big, L if ($t2 = big) go to L
beq $at, $t2, L
li $tl, small $t1 = small addi $t1, $zero, small
lui $t2, upper(big)
li $t2, big $t2 = big
ori $t2, $t2, lower(big)
slt $at, $t5, $t3
ble $t3, $t5, L if ($t3 <= $t5) goto L
beq $at, $zero, L
slt $at, $t5, $t4
bgt $t4, $t5, L if ($t4 > $t5) go to L
bne $at, $zero, L
slt $at, $t5, $t3
bge $t5, $t3, L if ($t5 >= $t3) go to L
beq $at, $zero, L

154
li $at, big
addi $t0, $t2, big $t0 = $t2 + big
add $t0, $t2, $at
li $at, big
1w $t5, big($t2) $t5 = Memory[$t2+big] add $at, $at, $t2
lw $t5, 0($at)

2. To add an addressing mode to MIPS, that allows arithmetic instructions to


directly access memory. If we add an instruction, addm, as is found in the 80x86,
as following brief description:
addm $t2, 100($t3) # $t2 = $t2 + Memory[$t3 + 100]
then please describe the steps of this instruction ‗addm‘ might take. Then write a
paragraph or two explaining why it would be hard to add this instruction to the
MIPS pipeline. (Hint: You may have to add one or more additional stages to the
pipeline.)

Answer:
(1) The steps of this instruction ‗addm‘ might take:
Step1: instruction fetch
Step2: instruction decode and register fetch
Step3: memory address calculation
Step4: memory access
Step5: execution
Step6: write back
(2) 加入addm指令會造成MIPS pipeline stage由5變6,而在clock rate沒有增加
的情況下,會降低pipeline的performance。另外pipeline的stages變多了,
所以硬體資源也要增加(stage 3需加入一加法器來計算記憶體位址),解決
hazard的penalty也會增加。

3. Here are two different I/O systems intended for use in transaction processing:
System A can support 1500 I/O operations per second.
System B can support 1000 I/O operations per second.
The systems use the same processor that executes 500 million instructions per
second. The latency of an I/O operation for these two systems differs. The latency
for an I/O on system A is equal to 20 ms, while for system B the latency is 18 ms
for the first 500 I/Os per second and 25 ms per I/O for each I/O between 500 and
1000 I/Os per second. In the workload, every 10th transaction depends on the
immediately preceding transaction and must wait for its completion. What is the
maximum transaction rate that still allows every transaction to complete in 1
second and that does not exceed the I/O bandwidth of the machine? (Assume that
each transaction requires 5 I/O operations and that each I/O operation requires
10,000 instructions. And for simplicity, assume that all transaction requests arrive
at the beginning of a 1-second interval.)

155
Answer:
System A
Transactions: 9 Compute 1
I/Os: 45 latency 5
Times: 900 ms 100 s 100 ms exceeds 1 s
Thus system A can only support 9 transactions per second.

System B—first 500 I/Os (first 100 transactions)


Transactions: 9 Compute 1 1
I/Os: 45 latency 5 5
Times: 810 ms 100 s 90 ms 90 ms 990.1 ms
Thus system B can support 11 transactions per second.

156
96 年成大電機

1. Given the following MIPS instruction code segment, please answer each question
below.
16 LI: addi $t0, $t0, 4
20 lw $s1, 0($t0)
24 sw $s1, 32($t0)
28 lw $t1, 64($t0)
32 slt $s0, $t1, $zero
36 bne $s0, $zero, L1
(a) Given a pipeline processor which has 5 stages: IF, ID, EX, ME, WB. Assume
no forwarding unit is available. There are hazards in the code, please detect
the hazards and point out where to insert no-ops (or bubbles) to make the
pipeline datapath execute the code correctly. You don't need to rewrite the
entire code segment. You can simply indicate the location where you would
insert the no-ops. For example, if you want to insert 6 no-ops between the
instruction addi at address 16 and lw at address 20, you can state something
like "6 no-ops between 16 and 20".
(b) Assume a forwarding unit is available to only forward data from ME and/or
WB to EX. Please reorder/rewrite the code to maximize its performance. Note
that you should consider maximizing the performance based on the
assumption that the loop might be iterated a few times. You may insert no-ops
in the code segment to resolve inevitable hazards if any.

Answer:
(a) 2 no-ops between 16 and 20
2 no-ops between 20 and 24
2 no-ops between 28 and 32
2 no-ops between 32 and 36
1 no-ops behind 36
(b)
L1: addi $t0, $t0, 4
lw $t1, 64($t0)
lw $s1, 0($t0)
slt $s0, $t1, $zero
nop
nop
bne $s0, $zero, L1
sw $s1, 32($t0)

157
2. Assume you are asked to design the architecture of the memory hierarchy for a
computer with a 32-bit 4 GHz MIPS processor. The processor has a 64 KB 1st
level cache and a 256 KB 2nd level cache on chip. The 1st level cache is 2-way
associative and the 2nd level cache is 8-way associative. Assume the word size is
32 bits and the block size for both caches is 8 bytes. Assume both caches are
virtually addressed. The size of the physical memory is 2 GB. The memory space
is byte-addressing. Based on the given information, please answer the following
questions.
(a) Please locate virtual address 0x0000 ABCD in both caches. That is, show
which set the address will be if it's in the 1st level cache and 2nd level cache,
respectively.
(b) Suppose the update policy of the 1st level cache is write allocate, write back,
and LRU replacement. Execute each of the following instruction and indicate
whether it's a hit or a miss for (1) to (5) on 1st level cache. (Assume initially
the content of $s0 = 0x0000 0000, $s1 = 0xFEDC 0000, $s2 = 0x8000 0000,
and both caches are empty.)
Instruction Cache hit or miss
lb $t0, 0x001F($s0) miss
lb $t1, 0x801D($s1) (1)
lb $t2, 0x0018($s1) (2)
sb $t1, 0x0018($s0) (3)
lb $t0, 0x001C($s2) (4)
sb $t0, 0x001A($s1) (5)
Finally, after executing the piece of codes, has the memory been updated?
Please answer yes or no.
(c) Suppose the access time to main memory with 2nd level cache disabled is
100ns, including all the miss handling. Suppose the base CPI of the processor
is 2, assuming all references hit in the 1st level cache. Further assume the test
program you use to test the memory hierarchy has a 4% miss rate per
instruction for 1st level cache. Now with 2nd level cache enabled, the test
program has a miss rate of 0.2%. Suppose the access time of 2nd level cache is
20ns for either hit or miss. How much performance improvement you will get
with the 2nd level cache enabled?

Answer:

158
(a)
L1 cache L2 cache
Cache size 64 KB 256 KB
Mapping 2-way 8-way
Block size 8B 8B
64 KB 256 KB
# of sets 8B  4 K 8B  4 K
2 8

Address Tag Set Offset Tag Set Offset


format 17 12 3 17 12 3

Since 0x0000 ABCD = 0000 0000 0000 0000 1010 1011 1100 11012
The set address for both level 1 and level 2 cache is 010101111002
(b)
Memory address
hit or
Instruction Binary
Hex miss
Tag index offset
lb $t0, 0x001F($s0) 0000001F 00000000000000000 000000000011 111 Miss
lb $t1, 0x801D($s1) FEDC801D 11111110110111101 000000000011 101 Miss
lb $t2, 0x0018($s1) FEDC0018 11111110110111000 000000000011 000 Miss
sb $t1, 0x0018($s0) 00000018 00000000000000000 000000000011 000 Miss
lb $t0, 0x001C($s2) 8000001C 10000000000000000 000000000011 100 Miss
sb $t0, 0x001A($s1) FEDC001A 11111110110111000 000000000011 010 Miss

No, memory has not been updated, since write-back strategy is used.
(c)
The miss penalty to main memory is 100 / 0.25 = 400 clock cycles
For the processor with one level of caching CPI = 2.0 + 400 × 4% = 18
The miss penalty for an access to the second-level cache is 20 / 0.25 = 80
clock cycles
For the two-level cache, total CPI = 2.0 + 4% × 80 + 0.2% × 400 = 6
Thus, the processor with the secondary cache is faster by 18/6 = 3

159
3. True or false:
(a) In processor implementation, single-cycle implementation is not as good as
multi-cycle implementation because single-cycle implementation tends to
have a longer clock cycle and higher CPI than multi-cycle implementation.
(b) Thrashing occurs if a program constantly accesses more virtual memory than
it has physical memory, causing continuously swapping between memory and
disk.
(c) RAID 3,4, and 5 all have the capability of performing parallel reads and
writes.
(d) Suppose a program runs in 60 seconds on a machine, with multiplication
responsible for 40 seconds of the time. According to Amdahl's law, we can
simply improve the speed of multiplication to have the program run at 3 times
faster.
(e) The idea of using two levels of cache is that 1st level cache is to minimize the
cache miss ratio and the 2nd level cache is to reduce the cache hit time.

Answer:
(a) False, single-cycle implementation had lower CPI than multi-cycle
implementation.
(b) True
(c) False, for small access, it is true that RAID 4 and 5 have the capability of
performing parallel reads and writes. But only one request can be served at a
time for RAID 3, no matter the amount of access is small or big
(d) False, suppose x is the improvement then, 60/3 = 20 + 40/x  x = ∞
(e) False, 1st level cache is to minimize the cache hit time and the 2nd level cache
is to reduce the cache miss ratio

4. Design a direct memory access (DMA) controller in a multi-master bus-based


system.
(a) Show a generic design that can be used for transferring data between the main
memory and the I/O. Specify the functionality of the registers used and the
interface signals of the DMA controller.
(b) Using the interface signals, elaborate the DMA operations that transfer a
block of data from memory to the I/O.

Answer:
(a)

160
Memory

Data bus

Memory Data Data Address


address count Register Register

DMA request
Data count
DMA Acknowledge
Control logic
Device Interrupt
number DMA Controller

I/O bus
CPU

I/O I/O
Device Device

Data count: indicate the amount of data to be transfer


Data register: buffer the data to/from memory
Address register: indicate which memory address to be access
DMA request: to ask CPU for a DMA transfer
DMA acknowledge: acknowledge DMA controller to a DMA request
Interrupt: single CPU when DMA controller needs help
(b)
1. DMA controller ask CPU for a DMA transfer by DMA request
2. CPU response the DMA request by DMA acknowledge
3. CPU initializes DMA controller to tell
read/write
device address
starting address of memory block for data
amount of data to be transferred
4. CPU carries on with other work
5. DMA controller deals with transfer
6. DMA controller sends Interrupt when finished

161
5. Fill in the appropriate term or terminology for the underline fields:
(a) move $s1, $zero = addi __, __, __
(b) CPU execution time = Instruction count  _________  clock cycle time.
(c) After a silicon ingot is sliced, it is called a .
(d) For a 32-bit register, if the least significant byte (B0) is stored at memory
address 4N where N is an integer ≥ 0, this storage order is called ____
endian.
(e) For a 32-bit register, if the least significant byte (B0) is stored at memory
address 4N + 3 where N is an integer ≥ 0, this storage order is called ____
endian.

Answer:
(a) $s1, $zero, 0
(b) CPI (cycles per instruction)
(c) blank wafer
(d) little
(e) big

162
95 年成大電機

1. For the pipeline processor shown below, the following sequence of instructions
causes the pipeline hazard due to load-use dependency.
lw $4, 100($2)
add $8, $4, $4
Assuming the lw instruction will take 2 data memory cycles to get the data from
the memory and a forwarding circuit is employed, detail the design of the hazard
detection unit for this processor assuming MIPS-like ISA is used. Sketch your
design in the processor pipeline diagram and explain the signals you use. Write
down the behavioral code for the logic of the hazard detection unit.
IF/ID ID/EX EX/MEM1 MEM1/MEM2 MEM2/WB

PC IM RF DM1 DM2

Forwarding
Unit

Answer:
Since lw will take 2 memory cycles to get the data, we need to stall the pipeline
for 2 clock cycles to solve the load-use data hazard.
Hazard
Detection
unit

       
IF/ID ID/EX EX/MEM1 EX/MEM2 MEM2/WB

PC IM RF DM1 DM2

Forwarding
Unit
Signals:
 IF/ID.RegisterRs  IF/ID.RegisterRt  ID/EX.RegisterRt
 ControlSignalClear
 ID/EX.MemRead  EX/MEM1.MemRead  EX/MEM1.RegisterRd
 PCWrite  IF/IDWrite

Behavioral code:
IF ((ID/EX.MemRead) and
((ID/EX.RegisterRt = IF/ID.RegisterRs) or

163
(ID/EX.RegisterRt = IF/ID.RegisterRt)))
stall the pipeline
IF ((EX/MEM1.MemRead) and
((EX/MEM1.RegisterRd = IF/ID.RegisterRs) or
(EX/MEM1.RegisterRd = IF/ID.RegisterRt)))
stall the pipeline
註:目的暫存器號碼不管來自指令Rd或Rt欄位,傳遞過EX stage後都以Rd表示

2. For the above pipelined processor, come up with the behavioral code for the logic
of the forwarding unit assuming MIPS-like ISA is used. State you assumptions if
any.

Answer:
EX hazard:
IF (EX/MEM1.RegWrite
and (EX/MEM1.RegisterRd  0)
and (EX/MEM1.RegisterRd = ID/EX.RegisterRs)) ForwardA = 01
IF (EX/MEM1.RegWrite
and (EX/MEM1.RegisterRd  0)
and (EX/MEM1.RegisterRd = ID/EX.RegisterRt)) ForwardB = 01
MEM1 hazard:
IF (EX/MEM2.RegWrite
and (EX/MEM2.RegisterRd  0)
and not (EX/MEM1.RegWrite
and (EX/MEM1.RegisterRd  0)
and (EX/MEM1.RegisterRd = ID/EX.RegisterRs)
and (EX/MEM2.RegisterRd = ID/EX.RegisterRs)) ForwardA = 10
IF (EX/MEM2.RegWrite
and (EX/MEM2.RegisterRd  0)
and not (EX/MEM1.RegWrite
and (EX/MEM1.RegisterRd  0)
and (EX/MEM1.RegisterRd = ID/EX.RegisterRt)
and (EX/MEM2.RegisterRd = ID/EX.RegisterRt)) ForwardB = 10
MEM2 hazard:
IF (MEM2/WB.RegWrite
and (MEM2/WB.RegisterRd  0)
and not (EX/MEM2.RegWrite
and (EX/MEM2.RegisterRd  0)
and (EX/MEM2.RegisterRd = ID/EX.RegisterRs)
and not (EX/MEM1.RegWrite
and (EX/MEM1.RegisterRd  0)
and (EX/MEM1.RegisterRd = ID/EX.RegisterRs)
and (MEM2/WB.RegisterRd = ID/EX.RegisterRs)) ForwardA = 11

164
IF (MEM2/WB.RegWrite
and (MEM2/WB.RegisterRd  0)
and not (EX/MEM2.RegWrite
and (EX/MEM2.RegisterRd  0)
and (EX/MEM2.RegisterRd = ID/EX.RegisterRt)
and not (EX/MEM1.RegWrite
and (EX/MEM1.RegisterRd  0)
and (EX/MEM1.RegisterRd = ID/EX.RegisterRt)
and (MEM2/WB.RegisterRd = ID/EX.RegisterRt)) ForwardB = 11

165
3. Assume you are asked to design the architecture of the memory hierarchy for a
computer which has a 32-bit MIPS processor with a clock rate of 2 GHz. The
processor has a 32 KB(Kilo-Byte) 1st level cache and a 256 KB 2nd level cache on
chip. The 1s1 level cache is 4-way associative and the 2nd level cache is fully
associative. Assume the word size is 32 bits and the block size for both caches is
32 bytes. The size of the physical memory is 2 GB(Giga-Byte). The memory
space is byte-addressing. Based on the given information, please answer the
following questions.
(1) How many bits are needed for each of the fields in the following structure to
index 1st level cache and the 2nd level cache, respectively? Note: show the
answers for 1st level cache and 2nd level cache separately.
Tag Index Block Offset

(2) Suppose the access time to main memory with 2nd level cache disabled is
250ns. That is, the access time includes 1st level miss handling. Suppose the
base CPI of the processor is 2, assuming all references hit in the 1st level
cache. Further assume the test program you use to test the memory hierarchy
has a 3% miss rate per instruction for 1st level cache. Now with 2nd level
cache enabled, the test program has a miss rate of 0.2%. Suppose the access
time of 2nd level cache is 20ns for either a hit or a miss. How much
performance improvement you will get with the 2nd level cache enabled?
(3) Suppose this computer has a 32-bit virtual address space and 4 KB page size.
 How many virtual pages are there?
 How many physical pages are there?
 Assume each entry in a page table consume 1 word, what is the size of the
page table in bytes?
(4) Following the specification in (3), given the page table in the following,
please derive the physical address of the virtual address 0x00001004, and then
locate the address in the 1st level cache. That is, show which set the address
will be if it's in 1st level cache.
Page Entry no Valid Dirty Ref Physical page address
0 1 1 1 0x0001 1000
1 1 0 0 0x0004 1000
2 1 0 0 0x0001 2000
3 1 1 1 0x0003 3000
4 1 0 1 0x000F E000
∙∙∙ ∙∙∙∙ ∙∙∙∙ ∙∙∙∙ ∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙

Answer:
(1) Level-one cache
Tag Index Block Offset
18 8 3

166
 Level-two cache
Tag Index Block Offset
26 0 3
註:Since there are 32/4 = 8 words in a block, the block offset = log28 = 3
(2) CPI for 2nd cache disable = 2 + 0.03  250/0.5 = 17
CPI for 2nd cache enable = 2 + 0.03  20/0.5 + 0.002  250/0.5 = 4.2
The performance improvement = 17 / 4.2 = 4.05 times
(3)  232/4K = 220 = 1M virtual pages
 2G/4K = 231/212 = 219 = 0.5 M physical pages
 1M  4 bytes = 4Mbytes
(4) virtual address = 0000100416 = 0000 0000 0000 0000 0001 0000 0000 01002
virtual page no. = 0000 0000 0000 0000 00012
page offset = 0000 0000 01002
Lookup for the Page Entry no. 1, the Physical page address is
0004 100016 = 0000 0000 0000 0100 00012
The physical address = 0000 0000 0000 0100 0001 0000 0000 01002
Tag Index Block Offset Byte Offset
18 8 3 2
000000000000100000 10000000 001 00
So, the virtual address will map to cache set numbered 128

4. Suppose you run photoshop to load a 4 MB(Mega-Byte) image file from the hard
disk to the memory for editing. Unfortunately, your disk is so fragmented that all
data blocks associated with this file is scattered around the disk randomly. The
parameters of the disk are listed below.
Average seek time: 12 milli-second
Rotational speed: 5000 RPM(rotation per minute)
Block size: 512 bytes
Transfer rate: 0.4 MB/sec
Ignore all other overheads. How long does the photoshop program need to wait
for the file transfer to finish from the hard disk to the memory?

Answer:
We have 4MB/0.5KM = 8K blocks to load
0.5 0.5k
Move a block from the disk require: 12m  5000  0.4M  19.25ms
60
The total time the load the file require: 8K  19.25 ms = 157.7 s

167
94 年成大電機

1. Which of the following is (are) true?


(a) For a fixed size cache memory, the larger the line size is the smaller the tag
memory the cache uses.
(b) For a fixed size cache memory, the larger the line size is the larger the tag
memory the cache uses.
(c) For a direct-mapped cache, no address tag is the same in the tag memory.
(d) For a two-way associative cache, no address tag is the same in the tag
memory.

Answer: (a)

2. Which of the following is (are) true for a 64KB cache with a line size of 32 bytes?
Assume that the cachable memory is 1 GB.
(a) In a direct-mapped implementation, the tag length is 16 bits; the index field is
11 bits in length.
(b) In a direct-mapped implementation, the tag length is 14 bits; the index field is
16 bits in length.
(c) In a direct-mapped implementation, the tag length is 14 bits; the field
determining line size is 5 bits in length.
(d) In a two-way implementation, the tag length is 15 bits; the index field is 10
bits in length.

Answer: (c)(d)
3. Which of the following is (are) true?
(a) A non-blocking cache allows hit under miss to hide miss latency.
(b) A non-blocking cache does not allow miss under hit to hide miss latency.
(c) Miss under miss allows multiple outstanding cache misses.
(d) A non-blocking cache allows a load instruction to access the cache if the
previous load is a cache miss.

Answer: (a)(b)(c)(d)

4. Which of the following is (are) true for the forwarding unit in a 5-stage pipelined
processor?
(a) The forwarding unit is used to detect the instruction cache stalling.
(b) The forwarding unit is a combinational circuit which detects the true data
dependency for EXE pipeline stage and selects the forwarded results for the
execution unit
(c) The forwarding unit is a pipeline register which detects the true data
dependency for EXE pipeline stage and selects the forwarded results for the

168
execution unit.
(d) The forwarding unit compares the source register number of the instructions
in the MEM and WB stages with the destination register number of the
instruction in the decode stage.

Answer: (b)

5. Which of the following is (are) not true?


(a) A control hazard is the delay in determining the proper data to load in the
MEM stage of a pipeline processor.
(b) A load-use data hazard occurs because the pipeline flushes the instructions
behind.
(c) To flush instructions in the pipeline means to load the pipeline with the
requested instructions using the predicted PC.
(d) A branch prediction buffer is a buffer that the compiler uses to predict a
branch.

Answer: (a), (b), (c), (d)

6. Which of the following is (are) true for the combinations of events in the TLB,
virtual memory system, and cache?
(a) It is possible that an access results in a TLB hit, a page table hit, and a cache
miss.
(b) It is possible that an access results in a TLB hit, a page table miss, and a cache
miss.
(c) It is possible that an access results in a TLB hit, a page table miss, and a cache
hit.
(d) It is possible that an access results in a TLB miss, a page table hit, and a cache
miss.

Answer: (a), (d)

7. Which of the following is (are) true?


(a) Virtual memory technique treats the main memory as a fully-set associative
write-back cache.
(b) Virtual address must be always larger than the physical address.
(c) TLB can be seen as the cache of a page table.
(d) If the valid bit for a virtual address is off, a page fault occurs.

Answer: (a), (c), (d)

169
8. Which of the following is (are) true?
(a) Memory-mapped I/O is an I/O scheme in which special designed I/O
instructions are used to access the memory space.
(b) The process of periodically checking status bits to see if it is time for the next
I/O operation is called interrupt.
(c) DMA is a mechanism that provides a device controller the ability to transfer
data directly to or from memory without involving the processor. DMA is
also a bus master.
(d) In a cache-based system, because of the coherence problem, thus DMA can
not be used.

Answer: (c)

9. Which of the following is (are) true?


(a) Computers have been built in the same, old-fashioned way for far too long,
and this antiquated model of computation is running out of steam.
(b) Dynamic power = Capacitive load × Voltage2 × Frequency switched
(c) Static power is due to the small operating current in CMOS.
(d) Yield = the percentage of good dies from the total number of dies on the
wafer.

Answer: (b), (d)

10. Which of the following is (are) true?


(a) ISA (instruction set architecture) is an abstraction which is the interface
between the hardware and the low-level software (assembly instructions).
This abstract interface enables different implementations of the same ISA to
run identical software.
(b) A caller is the program that is called by the procedure which gives the call.
(c) A basic block is a sequence of instructions with branch at the beginning and at
the end.
(d) A register file is a large memory for storing files

Answer: (a)

11. Which of the following statements confirming to the design principle: simplicity
favors regularity?
(a) Keeping all instructions in a single size.
(b) Always requiring three operands in arithmetic instruction
(c) Keeping the register fields in the same place in each instruction format.
(d) Having the same opcode field in the same place in each instruction format.

Answer: (a), (b), (c)

170
12. Which of the following is (are) true?
(a) Page fault is signaled by software.
(b) TLB exception can only be handled in hardware.
(c) A cache miss is handled in hardware.
(d) A page fault is handled in software.

Answer: (a), (c), (d)


註: TLB miss can be handled either in hardware or in software.

13. Which of the following is (are) true?


(a) When a cache write hit occurs, the written data are also updated in the next
level of memory. This is the write-through policy.
(b) There is no cache coherency problem for the write-through cache since the
data are written into the next level of memory.
(c) When a cache write hit occurs, the written data are only updated in the cache.
This is the write-back policy.
(d) Cache data inconsistency appears in a write-back cache when an I/O master
writes data into the memory block which is cached.

Answer: (a), (b), (c), (d)

14. Which of the following affects the CPI (clock per instruction)?
(a) Cache structure
(b) Memory data bus width
(c) Process technology
(d) Clock cycle time

Answer: (a), (b), (c)

15. Which of the following is (are) true?


(a) A C compiler compiles a C program into assembly language program for the
target machine.
(b) Pseudoinstructions are instructions which are not implemented in hardware.
(c) A label is a pseudoinstruction
(d) Pseudoinstructions are directives in an assembly language program

Answer: (a), (b), (d)


註: The compiler transforms the C program into an assembly language program, a
symbolic form of what the machine understands.

171
16. Which of the following is (are) true?
(a) In a pipeline processor, a structure hazard means that the hardware cannot
support the combination of instructions that are executed in the same clock
cycle.
(b) A structure hazard is caused by the branch instruction which is mispredicted.
(c) A structure hazard occurs if a unified cache is accessed both by the instruction
fetch and the data load at the same clock.
(d) A structure hazard is an exception which causes the processor to fetch
instruction from the exception handler.

Answer: (a), (c)

17. Which of the following is (are) true?


(a) Pipelining reduces the instruction execution latency to one cycle.
(b) Pipelining not only improves the instruction throughput but also the
instruction latency.
(c) Pipelining improves the instruction throughput rather than individual
instruction execution time.
(d) Pipelining improves the instruction throughput other than individual
instruction execution time.

Answer: (c)

18. Which of the following is (are) true?


(a) Temporal locality means the tendency to use data items that are close in
location.
(b) Temporal locality means the tendency to reuse data items that are recently
accessed.
(c) Spatial locality means the tendency to use data items that are close in location.
(d) Spatial locality means the tendency to reuse data items that are recently
accessed.

Answer: (b), (c)

19. Which of the following is (are) data transfer instructions?


(a) jal subroutine_1
(b) sw R1, 100(R2)
(c) beq R1, R2, start
(d) or R1, R2, R3

Answer: (b)

172
20. Which of the following instruction(s) performs NOT operation assuming R0 = 0?
(a) OR R1, R0, R3
(b) AND R1, R0, R3
(c) NOR R1, R0, R3
(d) ADD R1, R0, R3

Answer: (c)

173
96 年中央電機

1. (a) Assume variable h is associated with register $s2 and the base address of the
array A is in $s3. Now the C assignment statement is as below:
A[12] = h + A[8]
Write the compiled MIPS assembly code by filling the blanks (A), (B), (C).
lw $t0, (A) ($s3)
add (B) , (C) , $t0
sw $t0, 48($s3)
(b) If the program is run with a machine of 50 MHz clock, and it needs to execute
the code in (a) for 10000 times. Below is the number of cycles for each class of
instruction. How many micro seconds will it take to execute this program?
Instruction Cycles
Arithmetic 1
Data transfer 3
Jump 2
(c) If the machine in (b) is a 4-way VLIW machine, what is the MOPS (million
operations per second) of this machine?

Answer:
(a)
(A) (B) (C)
32 $t0 $s2
(b) (3 + 1 + 3)  10000  0.02 s =1400 s
(c) Since the VLIW is 4-way, the data transfer instruction can be completed in one
clock cycle and during this cycle, 3 operations have been done. The three
instructions (total 7 operations) can be done in 3 cycles, then MOPS = 7
(500/3) = 1166.67

2. (a) A PC has 4 MB of RAM beginning at address 00000000H. Calculate the very


last address (in hex) of this 4 MB block.
(b) If the starting address and the ending address of the ROM block are 008000H
and 010000H, calculate the size of the ROM in K.

Answer:
(a) 003FFFFFH
(b) 010000H – 008000H + 1 = 8000H + 1 = 1000 0000 0000 00002 + 1
= 215 + 1 = (32K + 1) bytes

註(b): 本題出自 Stallings 的―Computer Organization and Architecture‖一書,課本


所提供之答案為 32K bytes

174
3. Please draw the block diagram to build a simple MIPS datapath, only using the
following four components: "ALU", "Sign extend", "Data memory", "Registers".
ALU control MemWrite 5 Read
4 register 1
Read
Register 5 data 1
Read
Address Read numbers register 2
16 32 data Registers Data
Zero Sign 5 Write
ALU ALU extend Data
register
Read
Write
result data memory Write
data 2
Data
data

RegWrite
MemRead

Answer: Read 4 ALU operation


register 1 MemWrite
Read
data 1
Read
Instruction register 2 Zero
Registers ALU ALU
Write Read
result Address data
register
Read
Write data 2
Data
data
memory
RegWrite Write
data
16 32
Sign MemRead
extend

4. (a) Use the block diagram of 1-bit full adder as a basic block to construct a 32-bit
ripple adder (S = A + B).
(b) Add some logic blocks to the design of ripple adder so that it can do 2's
complement subtraction (S = A − B).
(c) Using 4-bit carry-lookahead blocks to form a 16-bit carry- lookahead adder.
Draw the block diagram and write down the corresponding logic equations.
(d) Compare the number of "gate delays" for the critical paths of two 16-bit adders,
one use ripple carry and one using two-level carry lookahead.

Answer:
(a)
a31 b31 a2 b2 a1 b1 a0 b0

c32 + + + + c0

s31 s2 s1 s0

175
(b) S = A + B when M = 0, S = A – B when M = 1
a3 b3 a2 b2 a1 b1 a0 b0

c4 + + + + M

s3 s2 s1 s0
(c)
Propagate: (表示一個4位元加法器的傳遞)
P0 = p3·p2·p1·p0
P1 = p7·p6·p5·p4
P2 = p11·p10·p9·p8
P3 = p15·p14·p13·p12
Generate: (表示一個4位元加法器的產生)
G0 = g3 + (p3·g2) + (p3·p2·g1) + (p3·p2·p1·g0)
G1 = g7 + (p7·g6) + (p7·p6·g5) + (p7·p6·p5·g4)
G2 = g11 + (p11·g10) + (p11·p10·g9) + (p11·p10·p9·g8)
4 位元加法器的進位(Ci):
C1 = G0 + c0·P0
C2 = G1 + G0·P1 + c0·P0·P1
C3 = G2 + G1·P2 + G0·P1·P2 + c0·P0·P1·P2
C4 = G3 + G2·P3 + G1·P2·P3 + G0·P1·P2·P3 + c0·P0·P1·P2·P3
b15
a1
a15
b14
a14
b13
a13
b12
a12

b11
a11
b10
a10
b9
a9
b8
a8

b7
a7
b6
a6
b5
a5
b4
a4

b3
a3
b2
a2
b1
a1
b0
a0
CarryIn

CarryIn

CarryIn

CarryIn

CarryIn
ALU3

ALU2

ALU1

ALU0
G3

G2

G1

G0
P3

P2

P1

P0
CarryOut

C4

C3

C2

C1
ci + 4
gi + 3
pi + 3

ci + 3

gi + 2
pi + 2

ci + 2

gii + 1
g
pi + 1

ci + 1

gi
pi
Carry-lookahead unit
Result12--15

Result8--11

Result4--7

Result0--3

176
(d)
(1) Ripper carry adder中每一個bit計算carry的部分都有2個gate的delays,因
此16-bit ripper carry adder的critical path共有2 × 16 = 32 gate delays.
(2) 在carry lookahead adder中gi和pi的產生需要1個gate delay。第一層進位
的產生以pi和gi為輸入需2個gate delay而第二層進位的產生以Pi和Gi
為輸入也需要2個gate delays,其中Pi,Gi的產生可以和第一層進位同
時計算因此Carry lookahead adder的critical path共有1 + 2 + 2 = 5 gate
delays.

5. (a) Figure 5.1 shows the partial finite state machine with control line settings to
control the datapath in Figure 5.2. Figure 5.2 below shows the MIPS
multicycle datapath with exception handling. Please fill in the names and
values of the control lines that need to be changed in the empty states A to E
such that the finite state machine can control the datapath correctly.
(b) Assume only exception, arithmetic overflow, can occur in this MIPS CPU.
Please redraw the finite state machine to handle this exception using the
datapath shown in Figure 5.2.

A B C D

Figure 5.1

177
Figure 5.2

Answer:
(a)
A B C D E
ALUSrcA =1
ALUSrcA =1 ALUSrcA =1 ALUSrcB =00 RegDst = 1
PCWrite
ALUSrcB =10 ALUSrcB =00 ALUOp = 01 RegWrite
PCSource = 10
ALUOp = 00 ALUOp = 10 PCWriteCond MemtoReg = 0
PCSource = 01

(b)

178
6. (a) While executing the MIPS code shown below, what is the target address of the
branch instruction if it is taken? (Assume the starting address of this code
segment is 28dec.)
lw $4, 50($7)
beq $1, $4, 3
add $5, $3, $4
sub $6, $4, $3
or $7, $5, $2
slt $8, $5, $6
(b) Assume this code is executed on a MIPS CPU with 5 pipeline stages and data
forwarding capability. If this CPU uses "always assume branch not taken"
strategy to handle branch instruction but the branch is taken in this example,
how many clock cycles are required to complete this program? Please explain
your answers in detail.

Answer:
(a) Branch target address will be the address of instruction slt; therefore, the target
address = 28 + 5  4 = 48dec.

179
(b) Suppose that the branch decision is made at ID stage. Since the branch is taken,
3 instructions should be executed for this code sequence. 2 clocks should be
stalled between lw and beq instructions since the branch decision is made at ID
stage. Besides 1 instruction (add) will be flushed. Therefore, the total cycles for
completing the code sequence = (5 – 1) + 3 + 2 + 1 = 10 clock cycles.

7. (a) Please briefly explain the relationship between virtual memory, TLBs, and
caches in the memory system of modern computers.
(b) Assume there are two small caches, each consisting of six one-word blocks.
One cache is direct mapped, and the other cache is two-way set associative.
Please find the number of misses of each cache organization given the
following sequence of block address: 0, 15, 12, 3, 15, 0. Besides the number of
misses, please also explain your answers in detail.

Answer:
(a) The TLB contains a subset of the virtual-to-physical page mappings that are in
the page table. On every reference, we look up the virtual page number in the
TLB. Under the best of circumstances, a virtual address is translated by the
TLB and sent to the cache where the appropriate data is found, retrieved, and
sent back to the processor.
(b) Direct map Two-way set associative
Block Cache Block
Hit/Miss Set no. Hit/Miss
address block address
0 0 Miss 0 0 Miss
15 3 Miss 15 0 Miss
12 0 Miss 12 0 Miss
3 3 Miss 3 0 Miss
15 3 Miss 15 0 Miss
0 0 Miss 0 0 Miss
Number of misses 6 Number of misses 6

180
95 年中央電機

1. Answer the following problem briefly.


(1) The ARM processor is a RISC or a CISC machine.
(2) How much general purpose registers does the ARM processor have in
supervisor mode?
(3) The ARM processor supports only little-endian, only big-endian, or both
little-endian and big-endian memory addressing modes.
(4) List two key objectives which the USB (Universal Serial Bus) has been
designed to meet.

Answer:
(1) RISC
(2) There are 13 general purpose register (R0 – R12) in the supervisor mode
(3) Little-endian
(4) USB was designed  to allow peripherals to be connected using a single
standardized interface socket  to improve plug-and-play capabilities by
allowing devices to be connected and disconnected without rebooting the
computer  to power to low-consumption devices without the need for an
external power supply and  to allow many devices to be used without
requiring manufacturer specific, individual device drivers to be installed.
註(2):There are 15 general purpose register (R0 – R14) in the user mode
註(3):The byte ordering on a MIPS chip is big-endian.

2. Design an 8-bit carry select adder using 4-bit ripple carry adders and 2-input
multiplexers.
(1) Draw the block diagram of the 8-bit carry select adder using the block
diagrams of 4-bit ripple carry adder and 2-input multiplexer shown in Fig. 1,
and explain how the 8-bit carry select adder works.
a[3:0] b[3:0]
x y

4-bit Ripple Carry Adder Cin 1 0 Sel

z
Cout s[3:0]
Figure 1: Block diagrams of 4-bit ripple carry adder and 2-input multiplexer.
(2) Assume the critical delay of a 4-bit ripple carry adder and a 2-input
multiplexer is 4ns and 0.2ns, respectively. Calculate the critical delay of the
8-bit carry select adder.

181
Answer:
(1)
a7 b7 a6 b6 a5 b5 a4 b4 a7 b7 a6 b6 a5 b5 a4 b4

4-bit Ripple Carry Adder 1 4-bit Ripple Carry Adder 0

1 0 1 0 1 0 1 0 1 0
a3 b3 a2 b2 a1 b1 a0 b0

c s7 s6 s5 s4

4-bit Ripple Carry Adder

s3 s2 s1 s0
(2) Critical delay of the 8-bit carry select adder = 4 + 0.2 = 4.2 ns

3. A computer system has L1 and L2 caches. The local hit rates for L1 and L2 are
90% and 80%, respectively. The miss penalties are 10 and 50 cycles, respectively.
Assuming a CPI of 1.2 without any cache misses and an average of 1.1 memory
accesses per instruction:
(1) What is the effective CPI after cache misses are factored in?
(2) Taking the two levels of caches as a single cache memory, what are its miss
rate and miss penalty?

Answer:
(1) The effective CPI = 1.2 + 1.1  (0.1  10 + 0.1  0.2  50) = 3.4
(2) Hit rate = 0.9 + 0.1  0.8 = 0.98  Miss rate = 1 – 0.98 = 0.02 = 2%
Miss penalty = 50 cycles

4. Figure 2 depicts a 4-stage branch prediction scheme that corresponds to keeping 2


bits of history. As long as a branch continues to be taken, we predict that it will
be taken the next time (the "Predict taken" state in the Fig. 2). After the first
misprediction, the state is changed, but we continue to predict that the branch will
be taken. A second misprediction causes another change of state, this time to a
state that causes the opposite prediction. Now a processor runs a program that
consists of two nested loops, with a single branch instruction at the end of each
loop and no other branch instruction anywhere. Also, the outer loop is executed
10 times and the inner loop 20 times.
Determine the accuracy of the following two branch prediction strategies:
(1) always predict taken,
(2) use the branch prediction scheme shown in Fig. 2.

182
(Hint: Accuracy is defined as the ratio of the number of predictions to the number
of total branch predictions.)

Figure 2: A 4-stage branch prediction scheme.

Answer:
The total number of branch instruction executed in the inner loop is 20  10 = 200.
The total number of branch instruction executed in the outer loop is 10.
(1) The number for the inner branch to guess wrong is 10. The number for the
outer branch to guess wrong is 1 Hence accuracy = (210 – 11)/210 = 0.9476
(2) If the 2-bit predict scheme is initialized at predict taken state
The number for the inner branch to guess wrong is 10. The number for the
outer branch to guess wrong is 1 Hence accuracy = (210 – 11)/210 = 0.9476

5. An example of MIPS machine assembly language notation, op a, b, c means an


instruction with an operation op on the two variables b and c, and to put their
result in a. Now a C language statement is as:
f = (g + h) − (i + j);
The variables f, g, h, i, and j can be assigned to the registers $s0, $s1, $s2, $s3,
and $s4, respectively. Now we use two temporary registers $t0 and $t1 to write
the compiled MIPS assembly code as follows:
add $t0, $s1, (A)
(C) $t1 $s3, (B)
sub $s0, (D), (E)
Please fill the results on the blank (A), (B), (C), (D), and (E).

Answer:
(A) (B) (C) (D) (E)
$s2 $s4 add $t0 $t1

183
6. Please answer the following questions:
(1) Discuss the differences between "RISC" and "CISC" machine.
(2) A performance metric on processor is called "MOPS". What is MOPS? If a
machine has the same metric on MIPS and MOPS, what does it means?

Answer:
(1)
RISC CISC
所有指令都有相同的長度
指令的長短不一
(指令長度一致,擷取時速度較快)
只有少數幾種定址模式 支援多種定址模式
只有少數幾種指令格式
支援多種指令格式
(指令格式少解碼較容易)
算數指令的運算元只允許來自暫存
算數指令的運算元可來自記憶體

記憶體中的資料需先載入至暫存器 記憶體中的資料可以直接處理而
後才能處理 不需事先載入或儲存
(2) MOPS means Millions of Operations Per Second
If a machine has the same metric on MIPS and MOPS means that this
machine is a single cycle machine.

7. Consider two different implementations, M1 and M2, of the same instruction set.
There are four classes of instruction (A, B, C, and D) in the instruction set. M1
has a clock rate of 500 MHz and M2 has a clock rate of 750 MHz.
Instruction class CPI (MachineM1) CPI (Machine M2)
A 1 2
B 2 2
C 3 4
D 4 4
(Hint: CPI means clock cycles per instruction)
(1) Assume the peak performance is defined as the fastest rate that a machine can
execute an instruction sequence chosen to maximum that rate. What are the
peak performances of M1 and M2? Please express as instructions per second?
(2) If the number of instructions executed in a certain program is divided equally
among the classes of instructions. How much faster is M2 than M1?

Answer:
(1) Peak performance for M1 = (500  106)/1 = 500  106
Peak performance for M2 = (750  106)/2 = 375  106
(2) CPI for M1 = (1 + 2 + 3 + 4)/4 = 2.5

184
CPI for M2 = (2 + 2 + 4 + 4)/4 = 3
Suppose the instruction count the program = IC
Execution for M1 = (2.5  IC)/500  106 = 5  IC (ns)
Execution for M2 = (3  IC)/750  106 = 4  IC (ns)
M2 is faster than M1 by 5/4 = 1.25 times

185
96 年中央資工

1. There is an unpipelined processor that has a 1 ns clock cycle and that uses 4
cycles, for ALU and branch operations and 5 cycles for memory operations.
Assume that the relative frequencies of these operations are 40%, 20%, and 40%,
respectively. Suppose that due to clock skew and setup, pipelining the processor
adds 0.2 ns of overhead to the clock. Ignoring any latency impact, how much
speedup in the instruction execution rate will we gain from a pipeline
implementation?

Answer:
Average instruction execution time
= 1 ns  ((40% + 20%)  4 + 40%  5)
= 4.4ns
Speedup from pipeline
= Average instruction time unpiplined/Average instruction time pipelined
= 4.4ns/1.2ns = 3.7

2. A computer system has L1 and L2 caches. The local hit rates for L1 and L2 are
95% and 80%, respectively. The miss penalties are 8 and 60 cycles, respectively.
(a) Assume a CPI (Cycles per Instruction) of 1.2 without any cache miss and an
average of 1.1 memory accesses per instruction, what is effective CPI after cache
misses are factored in? (b) Taking the two levels of caches as a single cache
memory, what are its miss rate and miss penalty?

Answer:
(a) Effective CPI = 1.2 + 1.1  [(1 – 0.95)  8 + (1 – 0.95)  (1 – 0.8)  60] = 2.3
(b) Hit rate = 95% + 5%  80% = 99%  Miss rate = 1 – 99% = 1%
(or Miss rate = 0.05  0.2 = 0.01)
Total miss penalty 60 cycles

3. Engineers in your company developed two different hardware implementations,


M1 and M2, of the same instruction set, which has three classes of instructions: I
(Integer arithmetic), F (Floating-point arithmetic), and N (Non-arithmetic). M1's
clock rate is 1.2GHz and M2's clock cycle time is 1ns. The average CPI for the
three instruction classes on M1 and M2 are shown below:
Class CPI for M1 CPI for M2
I 3.2 3.8
F 5.6 4.2
N 2.4 2.0
Please answer the following questions:
a. What are the peak performances of M1 and M2 in MIPS?

186
b. If 50% of all instructions executed in a program are from class N and the rest
are divided equally among F and I, which machine is faster and by what
factor?
c. The designers of M1 plan to redesign the machine to improve its
performance.
With the instruction mix given in question b, please evaluate each of the
following 4 redesign options and rank them according to their performance
improvement.
1. Using a faster floating-point unit which doubles the speed of
floating-point arithmetic execution.
2. Adding a second integer ALU to reduce the integer CPI to 1.6.
3. Using faster logic that allows a clock rate of 1.5GHz with the same CPIs.
4. The CPIs given in the table include the effect of instruction cache misses
at an average rate of 5%. Each cache miss adds 10 cycles to the effective
CPI of the instruction causing the miss. A new redesign option is to use a
larger instruction cache that would reduce the miss rate from 5% to 2%.
d. If you prefer the M2 implementation and would like to work out test
programs that run faster on M2 than on M1. Let x and y be the fraction of
instructions belonging to class I and F respectively. What kind of relationship
between x and y will you maintain?

Answer:
a. The ideal instruction sequence for M1 is one composed entirely of
instructions from class N. So M1's peak performance is (1.2×109)/2.4 = 500
MIPS. The ideal sequence for M2 contains only instructions from N. So M2's
peak performance is (1×109)/2 = 500 MIPS.
b. The average CPI of M1 = 0.5  2.4 + 0.25  3.2 + 0.25  5.6 = 3.4
The average CPI of M2 = 0.5  2.0 + 0.25  3.8 + 0.25  4.2 = 3
3
1  10 9  1.16
M1 then is 3.4
times faster than M2
1.2  10 9
c. We compare the instruction execution time for each design.
CPI1 = 0.5  2.4 + 0.25  3.2 + 0.25  5.6  0.5 = 2.7
Instruction execution time for design 1 = 2.7/1.2G = 2.25 ns
CPI2 = 0.5  2.4 + 0.25  1.6 + 0.25  5.6 = 3
Instruction execution time for design 2 = 3/1.2G = 2.5 ns
Instruction execution time for design 3 = 3.4/1.5G = 2.27 ns
Effective CPI = 3.4 = CPIbase + 10  0.05  CPIbase = 2.9
New CPI = 2.9 + 10  0.02 = 3.1
Instruction execution time for design 4 = 3.1/1.2G = 2.58 ns
Hence, the relative performance is
Design 1 > Design 3 > Design 2 > Design 4

187
(d) M2 faster than M1  Instruction Time for M2 < Instruction Time for M1 
3.8  x  4.2  y  2  (1  x  y) 3.2  x  5.6  y  2.4  (1  x  y)

1G 1.2G
x
  0.41
y

4. Please answer the following questions about memory hierarchy.


a. Please write short C codes to demonstrate the locality of memory access.
b. What are TLB and page table? Please describe clearly and systematically how
a memory access is completed by the processor cache/main
memory/TLB/page table/hard disk.
c. A computer system has a cache memory with 128K bytes. The 32-bit memory
address format is as follows: Tag bits: 31~15, Index (or Set) bits: 14~4, Offset
bits:3~0. Please derive the number of degrees of set associativity in this
cache.

Answer:
a. int list[100];
int i;
for (i = 0; i != 100; i++)
list[i] = i;
b. TLB: A cache that keeps track of recently used address mappings to avoid an
access to the page table.
Page table: The table containing the virtual to physical address translations in
a virtual memory system.
A virtual address issued from CPU is translated by the TLB. When a TLB
miss occurs, the entry of the mapping will move from page table to TLB. If
page table can not find this mapping, then a page fault occurs. Operating
System moves the missing page from hard disk to the physical memory and
the mapping thus will exist in the page table. We use the translated physical
address to search data/instruction in the cache. If cache hit, data/instruction
will send to CPU; otherwise, cache miss occurs. A separate control will move
the missing block from memory to cache and then send to CPU.
c. Offset = 4 bits  block size = 16 bytes
The number of blocks in the cache = 128KB/16B = 8K
The index field has 11 bits  the cache has 2K sets
The number of blocks in a set = 8K/2K = 4
Hence, hence the degree of set associativity = 4

188
95 年中央資工

1. There are two machines, M1 and M2, run the same instruction set. The
instruction set is composed of 3 classes of instructions (A, B. and C). M1 runs
at 100MHz, and M2 runs at 250 MHz. The average number of cycles per
instruction for each implementation are as follows:
Instruction Class CPI of M1 CPI of M2
A 1 2
B 1 2
C 3 2
(1) Define the peak performance as the fastest rate at which a machine could
execute an instruction sequence chosen to maximize the rate. What are the
peak performances of M1 and M2 in instructions per second?
(2) If a benchmark program consists of 30%, 30%, and 40% of all instructions for
class A, B, and C, respectively. Which machine will execute the program
faster, and by how much?

Answer:
(1) Peak performance for M1 = 100M / 1 = 100M
Peak performance for M2 = 250M / 2 = 125M
(2) The Average execution time of an instruction for M1
= (0.3  1 + 0.3  1 + 0.4  3) / 100M = 18 ns
The Average execution time of an instruction for M2
= (0.3  2 + 0.3  2 + 0.4  2) / 250M = 8 ns
M2 will execute the program faster by 18 / 8 = 2.25 times

2. Consider a 6-stage pipeline, if it needs two clock cycles in stage 3 and three
clock cycles in stage 4, while each of the other stages only needs one clock
cycle.
(1) Please draw a figure to indicate the execution flow of 5 instructions for the
above pipeline machine in ideal case (without any hazard).
(2) Please specify the number of clock cycles that is required to execute n
instructions for the above pipeline machine in ideal case (without any hazard).
(3) Please use figure to explain how the superscalar technique can be applied to
improve the performance of the above pipeline machine?

Answer:

189
(1)
S1 S2 S3 S4 S5 S6
I1 c1
I2 I1 c2
I3 I2 I1 c3
I4 I3 I2 I1 c4
I5 I4 I3 I2 I1 c5
I5 I4 I3 I2 I1 c6
I5 I4 I3 I2 I1 c7
I5 I4 I3 I2 I1 c8
I5 I4 I3 I2 I1 c9
I5 I4 I3 I2 c10
I5 I4 I3 c11
I5 I4 c12
I5 c13
(2) The total clock cycles = ((6 – 1) + 1+ 2) + n = 8 + n
(3) See the figure below, suppose 2-issue superscalar is used for (1). Only 11
clock cycles is needed rather 13 clock cycles to execute these instructions.
Superscalar allows more than one instruction to be executed in each stage,
hence the performance can be increased.
S1 S2 S3 S4 S5 S6
I1
c1
I2
I3 I1
c2
I4 I2
I5 I3 I1
c3
I4 I2
I5 I3 I1
c4
I4 I2
I5 I3 I1
c5
I4 I2
I5 I3 I1
c6
I4 I2
I5 I3 I1
c7
I4 I2
I5 I3 I1
c8
I4 I2
I5 I3 I1
c9
I4 I2
I5 I3
c10
I4
I5
c11

190
3. You are designing a memory system similar to the one shown in the following
figure: Your memory system design uses 16KB pages, 40-bit virtual byte
address and 32-bit physical address. The TLB contains 16 entries. The cache
has 4K blocks with 4-word block size (1 word has 4 bytes) and is 4-way set
associative.
Virtual address
31 30 29 14 13 12 11 10 9 3 2 1 0
Virtual page number Page offset
20 12

Valid Dirty Tag Physical page number


=
TLB
=
TLB hit =
=
=
=
20

Physical page number Page offset


Physical address
Block Byte
Physical address tag Cache index
offset offset
18 8 4 2

8
12 Data
Valid Tag

Cache

Cache hit

32

Data

(1) What is the total size of the physical page number in the page table for each
process on this processor? (Assuming that all the virtual pages are in use.)
(2) What is the total number of tag bits for the cache?
(3) What kind of associativity (direct-mapped, full-associative or set-associative)
will you choose for this TLB and why?
(4) Assume that you choose 2-way set associativity for TLB and that each block
has 4 words and the initial TLB is empty. After a series of address references
given as word addresses: 1, 4, 8, 5, 17, 32, 19, 1, 56, 9, 25. Please label each
reference in the list as hit or miss. (Note: the first word address of each block
is a multiple of 4 and LRU is used.)
(5) A memory reference in this system may encounter three types of misses: a
TLB miss, a page fault and cache miss. Consider all the 8 combinations of the
three events (with hit/miss) and identify/explain which cases are impossible.
(6) To improve the cache performance, you are allowed to change  cache size
 associativity  block size. Please describe their positive and negative
affects on the performance.

Answer:
(1) The size of a physical page number = 32 – log2(16K) = 32 – 14 = 18

191
The number of entries of page table = 240 / 16K = 226
The total size of physical page number in the page table = 226  18 bits
(2) The size of a tag = 32 – log2(4K/4) – log216 = 18
The total number of tag bits = 210  18  4 = 72KB bits
(3) Since this TLB has only 16 entries, use full-associative may be the most
appropriate for small miss rate.
(4) The size of offset field = 2 bits; the size of index field = log2(16/2) = 3
Word address
Decimal Binary Tag Index Hit/Mis
1 000001 0 0 Miss
4 000100 0 1 Miss
8 001000 0 2 Miss
5 000101 0 1 Hit
17 010001 0 4 Miss
32 100000 1 0 Miss
19 010011 0 4 Hit
1 000001 0 0 Hit (since 2-way)
56 111000 1 6 Miss
9 001001 0 2 Hit
25 011001 0 6 Miss
(5)

Page
TLB Cache Identify/Explain
table
Possible, although the page table is never really
hit hit miss
checked if TLB hits.
TLB misses, but entry found in page table: after retry,
miss hit hit
data is found in cache.
TLB misses, but entry found in page table; after retry,
miss hit miss
data misses in cache.
TLB misses and is followed by a page fault; after retry,
miss miss miss
data must miss in cache.
Impossible: cannot have a translation in TLB if page is
hit miss miss
not present in memory.
Impossible: cannot have a translation in TLB if page is
hit miss hit
not present in memory.
Impossible: data cannot be allowed in cache if the page
miss miss hit
is not in memory.

192
(6)
Positive affects Negative affects
Increase cache size will reduce Increase cache will increase hit
Cache size
capacity miss rate time
Increase assocativity will Increase assocativity will
associativity
reduce conflict miss rate increase hit time
Increase block size will
Increase block size may reduce increase miss penalty but too
block size
miss rate large block size will also
increase miss rate

193
94 年中央資工

1. True or False. (If the statement is false, explain the answer shortly)
(1) If we write a 32-bit (4-byte) data word, 0x12345678, to the address 0x2000 in
a big-endian system, then the byte stored in 0x2000 is 0x78.
(2) A write-through cache will have the same miss rate as a write-back cache.
(3) The case of ―TLB miss, Page Table miss, Cache hit‖ is possible.
(4) In memory hierarchy design, increasing the block size will help to decrease
the miss penalty.
(5) Conflict misses will not happen in fully associate caches.

Answer:
(1) False (the correct answer is 0x12)
(2) True
(3) False (block is not in memory and will not be in cache too)
(4) False (increasing block size will increase the time to transfer a block from
cache to memory and thus increases miss penalty)
(5) True

2. Please compare "write-through" and "write-back" in cache system.

Answer:
Polices Write-through Write-back
A scheme in which writs always A scheme that handles writes by
update both the cache and the updating values only to the block in
方法 memory, ensuring that data is the cache, then writing the modified
always consistent between the block to the memory when the block
two. is replaced
只有當block要被置換時才需寫回
優點 實作容易
Memory,因此CPU write速度較快
因為每次寫入cache時也都要
缺點 寫入Memory,因此CPU write 實作較難
速度較慢

194
3. Use Booth‘s algorithm to compute 5 × –3 (4-bit number) = –15 (8-bit number).
Complete the following table
Iteration Step Product
0 Initial step
(No) operation
1
Shift
(No) operation
2
Shift
(No) operation
3
Shift
(No) operation
4
Shift

Answer:
Iteration Step Product
0 Initial step 0000 1101 0
10  Prod - Mcand 1011 1101 0
1
Shift right 1101 1110 1
01  Prod +Mcand 0010 1110 1
2
Shift right 0001 0111 0
10  Prod - Mcand 1100 0111 0
3
Shift right 1110 0011 1
11  No operation 1110 0011 1
4
Shift right 1111 0001 1

4. (1) If the execution time of each instruction is t. How long does it take to execute n
instructions in an ideal 6-stage pipeline machine (assuming pipeline hazard and
overhead are ignored)?
(2) In which condition the execution sequence of instructions may be out of order
in a pipeline system?

Answer:
t (5  n)  t
(1) × ((6 – 1) + n) =
6 6
(2) 在有dynamic pipeline scheduling hardware support下instructions may be out
of order execution.

5. Please describe the criteria which will affect the encoding and length of
instruction set and the design considerations of a RISC processor.

Answer:

195
所有的動作都限制在一個時脈週期內完成,可以簡
Single-cycle operation
化硬體設計並加速指令執行。
Load/Store 的設計限制CPU只能以暫存器作為運算
元,若運算元是來自記憶體則可使用Load/Store指令
Load/Store design
來存取。因為暫存器速度較記憶體快,因此這種方
式可以提高CPU執行指令的效率。
相對於微程化的控制單元,Hardwire Control有較短
Hardwired control
的指令解碼時間。
Relatively few 可以加快指令及運算元的擷取。
instructions and
addressing modes
Fixed instruction format 可以加快指令的解碼。

196
93 年中央資工

1. 詳細解釋下列名詞或回答下列問題:
(1) Associative cache organization
(2) Nanoprogramming
(3) Horizontal and vertical microinstruction
(4) Describe the basic idea of Booth‘s multiplier and write down the conversion
table.
(5) Two-bit dynamic branch prediction.
(6) What is the carry-save adder (CSA)? Give the structure of adding 4 numbers
by CSA.

Answer:
(1) A compromise between a direct mapped cache and a fully associative cache
where each address is mapped to a certain set of cache locations. For example
in an ―n-way set associative‖ cache with S sets and n cache locations in each
set, block b is mapped to set ―b mod S‖ and may be stored in any of the n
locations.
(2) A combination of vertical and horizontal microinstructions in a two-level
scheme is called nanoprogramming. Many microinstructions occur several
times through the micro program. In this case, the distinct microinstructions
are placed in a small control storage. The nanostore then contains the index in
the microcontrol store of the appropriate microinstruction.
(3) A vertical microinstruction is highly encoded and looks like a simple
macroinstruction; it might contain a single opcode field and one or two
operand specifiers.
A horizontal microinstruction might be completely unencoded and each
control signal may be assigned to a separate bit position in the
microinstruction format.
(4) Booth‘s algorithm follows this scheme by performing an addition when it
encounters the first digit of a block of ones (0 1) and a subtraction when it
encounters the end of the block (1 0). This works for a negative multiplier as
well. When the ones in a multiplier are grouped into long blocks, Booth‘s
algorithm performs fewer additions and subtractions than the normal
multiplication algorithm. The following shows the conversion table.
ai ai-1 Operation
0 0 Do nothing
0 1 Add b
1 0 Subtract b
1 1 Do nothing
(5) A branch prediction scheme. A prediction must be wrong twice before it is
changed.
(6) A Carry-Save Adder is just a set of one-bit full-adders, without any

197
carry-chaining. Therefore, an n-bit CSA receives three n-bit operands, namely
A(n-1)..A(0), B(n-1)..B(0), and CIN(n-1)..CIN(0), and generates two n-bit
result values, SUM(n-1)..SUM(0) and COUT(n-1)..COUT(0).

2. Compare the instruction-set architectures in RISC and CISC processors in terms


of at least 5 important characteristics.

Answer:
RISC (reduced instruction set computer) CISC (complex instruction set computer)
All instructions are the same size
Instructions are not the same size
(32 bits on the MIPS)
Few addressing modes are supported Support a lot of addressing modes
Only a few instruction formats
Support a lot of instruction formats
(makes decoding easier)
Arithmetic instructions can only work Arithmetic instructions can work on
on registers memory
Data in memory can be processed
Data in memory must be loaded into
directly without using load/store
registers before processing
instructions

3. Explain the basic idea (giving the key points and the reasons) of the two major
division algorithms:
(1) restoring and
(2) nonrestoring divisions.

Answer:
(1) Restoring division: It keeps subtracting the divisor until the quotient goes
negative then go back to the previous step, shift right one place and continue. In
the restoring division remainder is always positive (or zero).
(2) Nonrestoring division: It subtracts the divisor until the divident changes sign, then
shift right one place. In the next iteration add the divisor until the divident
changes sign and then shift right one place go back to the first step (subtracting).
In the nonrestoring division, the remainder may be positive or negative.

198
92 年中央資工

1. 詳細解釋下列名詞或回答下列問題:
(a) Write down the IEEE 754 representation (in hex format) for the value
―-13.125‖
(b) What is the advantage of two‘s complement representation when compared
with signed-magnitude representation?
(c) Given an n-bit 2‘s complement representation (Xn-1 Xn-2 … X1 X0). What‘s the
value that it represents (write down the equation but NOT any specific
example)?
(d) Write down the Boolean equation of testing overflow in n-bit 2‘s complement
addition.
(e) What is CPI?

Answer:
(a) -13.12510 = -1101.0012 = -1.1010012 × 23
1 10000010 10100100000000000000000 = C152000016
(b) signed-magnitude 有以下缺點:
(1) 0 的表示方式有正負之分,對於粗心的 programmer 而言容易導致程式
執行時的錯誤。
(2) 執行加法時需要一額外步驟來決定運算結果為正或負。
(3) sign bit 可以放在最左邊也可以放在最右邊,該如何決定?
n2
n 1
(c)  X n 1 2   X i 2i
i 0
(d) Overflow = cn  cn-1, where cn-1 and cn is the carry-in and carry-out bits of the
(n – 1)th bit, respectively.
(e) Clock cycles per instruction: 平均執行一個指令所需要的CPU cycles.

2. About nanoprogramming technique:


(a) Sketch and explain the block diagram of nanoprogramming.
(b) Explain the reason why people use nanoprogramming for CISC processor
control design.

Answer:
(a) If the microstore is wide, and has lots of the same words, then we can save
microstore memory by placing one copy of each unique microword in a
nanostore, and then use the microstore to index into the nanostore. Figure 1a
illustrates the space requirement for the original microstore ROM. There are n
= 2048 words that are each 41 bits wide. Suppose now that there are 100
unique microwords in the ROM. Figure 1b illustrates a configuration that uses
a nanostore, in which an area savings can be realized if there are a number of
bit patterns that recur in the original microcode sequence. The unique

199
microwords (100 for this case) form a nanoprogram, which is stored in a
ROM that is 100 words deep by 41 bits wide. The microprogram now indexes
into the nanostore.
w = 41 bits w = 7 bits

m = 100 nanowords
n = 2048 words

n = 2048 words
Micro-
Original Program w = 41 bits
Microprogram

Fig. 1a Fig. 1b
(b) CISC Computer指令數目與格式繁多,控制電路設計複雜。控制電路如採
horizontal encoding設計,則需龐大的microstore空間,如採vertical encoding
設計,則速度太慢。因此CISC一般使用混合了vertical及horizontal encoding
的Nanoprogramming。

3. For the hierarchical ―carry-lookahead‖ adder design, please write down the
Boolean equations for the following signals.
(a) 16-bit ―Group Propagate‖ based on 4-bit Propagate Pi and 4-bit Generate Gi (i
= 0, 1, 2, 3).
(b) 16-bit ―Group Generate‖ based on 4-bit Propagate Pi and 4-bit Generate Gi
(i=0, l, 2, 3).

Answer:
(a) 16-bit Group Propagate = P3·P2·P1·P0
(b) 16-bit Group Generate = G3 + (P3·G2) + (P3·P2·G1) + (P3·P2·P1·G0)

4. Assume there is a 4 KB cache with set-associative address mapping. And the


cache is partitioned into 32 sets with 4 blocks in each set. The memory-address
size is 23 bits, and the smallest addressable unit is byte.
(a) To what set of the cache is the address 000010AF16 assigned?
(b) If the addresses 000010AF16 and FFFF7xyz16 can be as assigned to the same
cache set, what values can the address digits xyz have?

Answer:
4 KB
(a) block size =  32 bytes,
32  4
byte address = 000010AF16 = 0000 0000 000 0000 0001 0000 1010 11112 
block address = 0000 0000 000 0000 0001 0000 1012
block address mod 32 = 001012  set number = 5
(b) address 000010AF16對應至set 5表示FFFF7xyz16的bit 9 to bit 5要為00101
因此xyz16 = ××00101××××× 2 (×: 表示可以為1或0)

200
5. (a) What kinds of hazard may occur on the pipeline architecture?
(b) Consider a pipeline machine with 5 stages (instruction fetch, instruction decode,
execution, memory access, and write back) and load-store instruction set. What
hazards will occur in the following program (please indicate the numbers of the
instructions)? And how to solve it?
(1) Load R1, 3(R2) ;R1  Mem(R2 + 3)
(2) Add R3, R2, R7 ;R3  R2 + R7
(3) Store 0(R4), R3 ;Mem(R4)  R3
(4) Sub R2, R1, R5 ;R2  R1 − R5
(5) Load R6, 4(R3) ;R6  Mem(R3 + 4)
(6) Add R8, R6, R1 ;R8  R6 + R1
(7) OR R6, R4, R5 ;R6  R4 or R5
(8) Sub R3, R7, R2 ;R2  R7 − R2

Answer:
(a) structural hazard, data hazard, and control hazard
(b) 假設register write和register read可以發生在同一clock cycle則(2, 3), (5, 6)
指令間存在data hazards。指令(2, 3)的data hazard可以使用forwarding方法
解決。指令(5, 6)的data hazard頇先stall一個clock後再用forwarding來解
決。另外我們也可以藉由compiler重新安排指令順序或加入NOP指令來解
決所有data hazard的問題。

6. What is daisy-chain arbitration? And what is its application in the computer


systems?

Answer:
(1) Daisy chain arbitration – the grant line runs through the connected devices
from highest priority to lowest priority with priorities determined by position
of the devices. This scheme is simple, but a low priority device may be locked
out indefinitely, and the use of a daisy chain grant limits the bus speed.

Device 1 Device N
Highest Device 2 Lowest
Priority Priority

Grant Grant Grant


Bus Release
Arbiter Request

wired-OR
(2) Daisy-chain arbitration可以用來決定在bus system中哪一個bus master可以
獲得bus的使用權或是用在中斷系統中決定I/O device之中斷優先順序。

201
96 年中正資工

I. 綜合簡答題:問句請簡略扼要回答,敘述句請改正並指出關鍵錯誤處。
1. In coding assembly, an integer multiplication by a power of 2 can be replaced by
a left shift, and an integer division by a power of 2 can be replaced by a logical
right shift.

Answer:
That an integer division by a power of 2 can be replaced by a logical right shift is
true only for unsigned integers.

2. For a cache, one way to improve the performance of the write-through scheme is
to use a write buffer. With write buffer, the processor does not need stall while
performing a write.

Answer:
CPU still need to stall until data is written into the cache and buffer.

3. RAID 2 may recover a single-bit failure with extra 3-bits error correction
information. RAID 3 can achieve the same goal with only one extra bit
information and thus reduce the overhead. Explain briefly how RAID 3 does it.

Answer:
RAID 3 need only one extra parity bit to hold the check information in case there
is a failure. When a bit fails, then you subtract all the data in the good bits from
the parity bit; the remaining information must be the missing information.

II. 問答題
1. Pipeline.
(a) Explain the law of performance. That is, how can CPU time be factored into
three terms? (Hint: CPI and cycle time) Also, briefly explain what issues may
have impact on these three terms.
(b) Instead of a traditional 5-stage pipeline processor, supposedly we have a new
design by only allowing register operands for load/store instructions with no
offset. Specifically, all load/stores with nonzero offsets would become:
lw r3, 30(r5) is changed into addi r1, r5, 30
lw r3, (r1)
Can you give a 4-stage pipeline for the new design? Scratch the pipeline
organization diagram.
(c) Does the new design still require a "forwarding unit" or a "stall detection unit"
respectively? Why or why not?
(d) Referring to Question (a), please give the effects of the three terms due to the
new design in (b) and briefly explain your arguments.

202
Answer:
(a)
Seconds Instructions Clock cycles Seconds
Time = =  
Program Program Instruction Clock cycle
= Instruction count  CPI  Cycle time

Term Impact Factors


Instruction Algorithm, Programming language, Compiler, Instruction
count set architecture
Algorithm, Programming language, Compiler, Instruction
CPI
set architecture, Computer organization
Instruction set architecture, Computer organization, VLSI
Clock rate
technology

(b)
IF ID EX/MEM WB

IM Reg DM Reg

(c) The new pipeline structure still needs forwarding unit because data hazard
would happen if two consecutive instructions have data dependency. But it is
not necessary to keep the stall detection unit since memory access can be
done in the third stage and no stall is needed when load-use hazard occurs.
That is, the load-use hazard can be resolved just by forwarding unit.
(d) Instruction count will increase since one memory access instruction in 5-stage
pipeline requires two instructions in the new 4-stage pipeline structure.
CPI maybe decreases since the impact on hazards decreases.
Clock rate maybe decreases since more work to do in the third stage and the
latency are lengthened thus leading to long clock cycle time.

203
2. For a typical processor with five-stage pipeline, the following code is executed.
I1: add $5, $1, $3
I2: sub $1, $5, $4
I3: lw $3, 10($2)
I4: and $2, $5, $6
I5: bne $7, $8, Label
I6: add $9, $10, $11
I7: add $9, $11, $12
Label: lw $13, 25($14)
where In indicates the nth instruction in this code.
(a) Classify all true dependencies in the above code in terms of In.
(b) Identify all types of hazards in the above code in terms of In.
(c) Give the methods or techniques to resolve the hazards in (b).

Answer:
(a) (I1 and I2), (I1 and I4)
(b) Data hazard: (I1 and I2)
Control hazard: I5
(c)
Type Solutions
Software solution: (1)使用compiler插入足夠的no operation (nop)指
令。(2)重排指令的順序使data hazard情形消失。
Data
Hardware solution: (1)使用Forwarding前饋資料給後續有data hazard
hazard
的指令。(2)若是遇到load-use這種無法單用Forwarding解決的data
hazard時則頇先暫停(stall)一個時脈週期後再使用Forwarding
Software solution: (1)使用compiler插入足夠的no operation(nop)指
令。(2)Delay branch: 使用compiler於branch指令後插入安全指令。
Control Hardware solution: (1)將分支判斷提前以減少發生control hazard時所
hazard 需要清除指令的個數。(2)使用預測方法(static or dynamic)。當預測
正確時pipeline便可全速運作其效能就不會因為分支指令而降低。當
預測不正確時我們才需要清除掉pipeline中擷取錯誤的指令。

204
95 年中正資工

I. 綜合簡答題:問句請簡略扼要回答,敘述句請改正並指出關鍵錯誤處
1. For floating operations, (A + B) + C is equal to A + (B + C).
2. A 1GHz RISC CPU is faster than a 1 GHz CISC CPU, because the CPI of RISC
is smaller than that of CISC.
3. The law of performance indicates that CPU time can be shown as a product of 3
terms. What are those three terms? Explain what factors may have impact on the
three terms respectively.
4. What are advantages and side effects with increasing the block size for a given
cache size?
5. How many bits are required to store the ROM entries for a ROM with m-bit
inputs and k-bit output?
6. When an equal is true, why is an instruction ―beq r1, r2, imm16‖ operated as:
PC  PC + 4 + sign_extension(Imm16)||00b?

Answer:
1. Wrong, floating point addition is not associative. That is, (A + B) + C  A +
(B + C)
2. Wrong, this statement does not consider the instruction ability for both CPU.
3.
Term Factors
Instruction count Algorithm, Programming language, Compiler, Instruction
set architecture
CPI Possibly algorithm, Programming language, Compiler,
Instruction set architecture, Organization
Clock rate Organization, Technology
4. Increasing block size decreases miss rate due to spatial locality. But very large
block could increase miss rate. Increase block size also increase cache miss
penalty.
5. 2m  k bits
6. beq指令target address的算法是採用PC相對,其計算方式為將指令16位元
立即值擴充為32位元後[sign_extension(Imm16)]向左移兩個位元
[sign_extension(Imm16)||00b]再和加完4後的PC值相加[PC + 4 +
sign_extension(Imm16)||00b]。

205
II. 問答題
1. The pipelining is a key technique to improve the performance of CPU. It allows
that multiple instructions are overlapped in execution so that CPU can take less
time to finish the execution of an application. To implement the pipeline, we can
use single-cycle or multicycle approach. Please answer the following questions in
brief.
(a) Compared to single-cycle approach, what are the advantages and
disadvantages of the multicycle approach.
(b) Give the designing principles of the multicycle approach.
(c) Write the impact of pipeline using the multicycle approach on clock rate, CPI
(clock cycles per instruction), instruction count, and branch miss penalty.

Answer:
(a) Only one clock is required for the execution of a single-cycle pipeline stage
and more than one clock are required for the execution of a multicycle
pipeline stage.
Advantage: easy to balance the execution time among stages
Disadvantage: require more registers to store the data produced in each clock
(b) Balance the jobs that should be done in each clock
Balance the jobs that should be done in each stage
(c) Clock rate: increase
CPI: increase
Instruction count: may increase (penalty increase and more NOP instruction
may be needed to resolve the hazards)
Branch miss penalty: increase

2. Given a static RAM and its operation in the following figures with the following
pin definitions:
• CE': The chip enable input, which is active low. When CE' = 1,
the SRAM's data pins are disabled, and when CE' = 0, the data
pins are enabled.
• R/W: The control signal indicating whether the current operation
is a read (i.e. R/W' = 1) or a write (R/W' = 0). Read and write are
normally specified relative to the CPU, so read means reading
from RAM and write means writing to RAM.
• Adrs: Specifying the address for the read or write.
• Data: Denoting a bi-directional bundle of signals for data transfer.
When R/W' = 1, the pins are output, and when R/W' = 0, the data
pins are inputs.
(a) Please design a 2M  16-bit SRAM system built from the 1M  8-bit
SRAM as shown below.
(b) Please illustrate the timing diagram of your design for the new memory
system.

206
CE‘
R/W
a
Adrs
Data
SRAM

Interface

CE‘

R/W‘

Adrs

Data From SRAM From CPU

Read Write

Answer:
(a)
Adrs 20
CE‘ CE‘
R/W‘ R/W‘
M1 M2
Adrs Adrs
Data Data
R/W‘
Data in 0~7 Data out 0~7
Data in 8~15 Data out 8~15
Adrs 0~19
CE‘ CE‘
R/W‘ R/W‘
M3 M4
Adrs Adrs
Data Data

207
(b)

Adrs20

Adrs0-19

R/W‘

Data From SRAM From CPU From SRAM From CPU

Read (M1, M2) Write (M1, M2) Read (M3, M4) Write (M3, M4)

208
94 年中正資工

I. 改正題: 請指出關鍵錯誤處並以 1~2 句理由解釋(無解釋者不予計分)


1. In comparison with one‘s complement, the major advantage of two‘s complement
is that for an algorithm it usually does fewer multiplications.
2. The critical path of the ripple carry adder is linearly proportional to the width of
the adder, while the critical path of the carry look-ahead adder (CLA) is
independent of the width of the adder.
3. The communication schemes between CPU and peripherals include polling,
interrupt, DMA, and write-through.
4. In the design of bus, the decoder is used to decide which bus master has the bus
ownership.
5. The major difference between the hardwired control unit and the
microprogramming control unit is that the former needs the support of program
counter.
6. The Booth‘s algorithm is faster when doing multiplications since it combines the
product and multiplier registers.
7. DMA (Direct Memory Access) can be used to improve the performance of CPU
by direct load/store instructions.
8. In coding assembly, an integer multiply by a power of 2 can be replaced by a left
shift, and an integer division by a power of 2 can be replaced by a right shift.
9 Amdahl‘s law is a rule stating that the performance enhancement within a given
improvement is unlimited so as that the performance of a chip can be doubled by
every two years.
10. Hierarchical caches (such as 2nd-level cache) are aimed at reducing average hit
time.
11. Since the memory is very cheap today, horizontal microinstruction or VLIW can
offer a cost-performance solution for embedded systems.
12. Widely variable instruction lengths (such as X86) still can give a deep pipeline
design as each stage can be balanced by reading a single instruction byte.
13. In general, the lookup of TLB can be in parallel with accessing the first level
cache.
14. PCI bus can operate up to 133MHz so that it can support fast memory transfer of
DDR memory in the north bridge.
15. In the SOC design for embedded systems, we can use popular CPU such as
Pentium 4 for the processor as well as the AGP bus for the on-chip bus to
integrate as a new system.

Answer:
1. 1‘s complement number做完運算需端回進位來修正運算結果,而2‘s
complement number則否
2. The critical path of the CLA is not independent of the width of the adder
3. Does not include write-through

209
4. The arbiter is used to decide which bus master has the bus ownership
5. The microprogramming control unit needs the support of program counter
6. Booth‘s algorithm是以減少一般乘法器加法次數來加快計算
7. Direct I/O operations.
8. Only unsigned integer division by a power of 2 can be replaced by a right
shift.
9. Performance enhancement possible with a given improvement is limited by
the amount that improved feature is used
10. Multiple level caches is used to reduce miss penalty
11. Embedded systems require small memory to reduce memory access time for
real-time computation
12. Reading a single instruction byte can not balance the job done by each stage
13. In general, the lookup of TLB should be accessed before the first level cache
14 PCI is connected to south bridge chip
15. AGP bus is used to connect Pentium 4 to outside world and can not be used as
an on-chip bus
II. 問答題

1. The Chung Cheng Computer Company (4 ® ) has announced two versions of
CPUs, CCU1 and CCU2, for BARM Inc (Better than ARM).
(1) CCU1 running at 100MHz has the instruction fractions and cycles as follows.
Give the CPI and MIPS rate
Instruction Class ALU lw/store branch
Frequency 50% 30% 20%
CPI of instruction 1 2 3
(2) CCU1 looks like a normal MIPS CPU with the below pipeline. Explain how
data-forwarding (or called bypassing) can be used to reduce the effects of load
delays and why we cannot eliminate them completely.
instruction decode/registe memory
execute register WB
fetch r fetch access
(3) Now in CCU2 we add a new stage by allowing the second operand to be
shifted by an arbitrary amount before ALU computation as follows. Give a
possible data path diagram to support the new design.
instruction decode/regi shift/rotate memory
execute register WB
fetch ster fetch operand 2 access
(4) With CCU2 running at 150MHz, if half of ALU instructions can be merged
into the shift operations and CPI of instructions are the same, please give CPI
and MIPS rate for CCU2. What is the speedup over CCU1?
(5) What data hazards may occur in CCU2 and how could they be resolved in the
above pipeline?

210
Answer:
(1) CPI = 1 × 0.5 + 2 × 0.3 + 3 × 0.2 = 1.7
100  10 6
MIPS = = 58.82
1.7  10 6
(2) Suppose that we have the following two instructions to execute in CCU1: lw
$t0 0($s1), add $t2, $t0, $t1. During clock 4, the data is still being read from
memory by load instruction while the ALU in stage 3 is performing the
operation for the following instruction. So, we can not use forwarding only to
completely remove the load delays. If we stall a clock cycle between load and
the following instruction and then apply forwarding technique then the load
delay can be eliminate completely.
(3)

Shift/
IM Reg ALU DM Reg
rotate

1  0.25  2  0.3  3  0.2


(4) CPI = = 1.93
0.25  0.3  0.2
150  10 6
MIPS = = 77.72
1.93  10 6
1.7  10ns
Speedup = = 1.76
0.75  1.93  6.67ns
(5) In addition to EX and MEM hazards, there are shift/rotate hazard in CCU2.
We can extend the forwarding unit like the following diagram to resolve this
problem.

IM Reg DM R eg
shift/
rotate

2. One of the most important aspects of computer design is instruction set


architecture (ISA) design because it affects many aspects of the computer system
including implementation of CPU and compiler.
(1) Give the useful information that must be encoded in an instruction set (ISA).
(2) What are the principles of designing the instruction set?

Answer:

211
(1) Memory (2) Simplicity favors regularity
Registers Smaller is faster
Instruction format Good design demands good compromises
Addressing mode Make the common case fast

3. The choice of associativity in memory hierarchy depends on the time cost of


cache miss and the hardware cost of implementation.
(1) Give the advantages and disadvantages of caches moving from direct-mapped
to set-associative caches.
(2) For a cache of 16KB with 16B per block and 32b input address, please
compute the total tag bits respectively required for caches with direct-mapped,
4-way set associative, and fully associative.
(3) Compare the different considerations for the choice of associativity in
designing caches, TLB, and virtual memory.

Answer:
(1) Advantage: decrease miss rate
Disadvantage: increase hit time and hardware overhead
(2) Direct-mapped:
offset = 4 bits, number of blocks = 16KB/16B = 1K = 210  tag field =
32 – 4 – 10 = 18 bits
The total tag bit = 1K × 18 = 18 Kbits
4-way set associative:
number of set = 1K/4 = 256 = 28  tag field = 32 – 4 – 8 = 20
The total tag bit = 256 × 4 × 20 = 20 Kbits
Fully associative:
tag field = 32 – 4 = 28
The total tag bit = 1 × 1K × 28 = 28 Kbits
(3)
因為cache miss penalty並不會太高,因此direct-mapped, set
Cache associative, and fully associative這些方法都可以採用,主要是
depend on我們對hit time的要求。
因為TLB是page table的cache所以跟cache一樣direct-mapped, set
TLB associative, and fully associative這些方法都可以採用。不過TLB
size通常都很小所以一般而言都是採用fully associative。
Virtual 因為page fault的penalty太高了,為了降低miss rate所以只得採用
memory fully associative。

212
93 年中正資工

I. 簡答題: 請回答對錯並以 1~2 句理由解釋(無解釋者不予計分)


1. For a C program, the length of its binary code produced for a RISC is always
longer than that produced for a CISC.
2. As compared to 1‘s complement number system, the advantage of 2‘s
complement number system includes easy management of sign bit.
3. In the design of a pipelined control unit, it is possible to encounter the following
hazards: data hazards, memory hazards, control hazards, and structural hazards.
4. In an on-chip bus like AMBA, the arbiter is used to decide which bus master has
the bus ownership.
5. For float-point numbers, A + (B + C) is equal to (A + B) + C.
6. If the cache associativity is increased, the miss rate will decrease, the cost will
increase, and the access time will decrease.
7. In MIPS instruction set, a branch instruction with a 16-bit offset field can give the
maximal distance of the target address with 216-1 bytes from the current PC
address.
8. The reasons that the PLA can be more efficient than ROM for control unit are (1)
PLA has no duplicate entries and (2) PLA is easier to decode the address.
9. Increasing the depth of pipelining may not always improve performance, mainly
because of larger memory access time.
10. Instruction set architecture may have significant impact on compiler design.

Answer:
1. True, RISC只提供一般指令而CISC除一般指令外還會有功能較強的指令
2. False, there is no sign bit in 2‘s complement representation
3. False, 不包括memory hazards
4. True
5. False, float-point numbers並不滿足結合律
6. False, access time will increase
7. False, about  215 words or  217 bytes from the current PC address
8. False, statement (2) is incorrect. In addition to statement (1) PLA also can
share product terms and can take into account don‘t cares
9. False, penalty of hazards
10. True, because the output of a complier is a sequence of instructions from ISA

213
II. 問答題
1. The Chung Cheng Computer Company (4 ® ) has designed two versions of
CPUs for Outel Inc. (named, CCU1 and CCU2 - CCU outside™), which can run
at 100 MHz and 200 MHz respectively. The average number of cycles and
frequency for each instruction class are as follows:
Instruction Cycles of Cycles of
Frequency
Class CCU1 CCU2
A 50% 1 2
B 30% 2 3
C 20% 3 3
(1) Which CPU is faster when executing the same program? What is speedup?
(2) What is "MIPS"? Compute the MIPS rating of CCU1 and CCU2 respectively.
(3) Give all possible techniques why they make CCU2 running at a faster clock if
the instruction set architecture is not changed.
(4) If 4 claims that CCU1 is a 32-bit CPU and CCU2 is a 64-bit CPU, how do
you define the differences between 32-bit CPU and 64-bit CPU?

Answer:
(1) CPIccu1 = 1 × 0.5 + 2 × 0.3 + 3 × 0.2 = 1.7
CPIccu2 = 2 × 0.5 + 3 × 0.3 + 3 × 0.2 = 2.5
Suppose IC represents the instruction count then
ExTimeccu1 = (1.7 × IC) / (100 ×106) = (17 × IC) ns
ExTimeccu2 = (2.5 × IC) / (200 ×106) = (12.5 × IC) ns. Hence, CCU2 is faster
Speedup = ExTimeccu1 / ExTimeccu2 = (17 × IC) / (12.5 × IC) = 1.36
(2) MIPS: a measurement of program execution speed based on the number of
millions of instructions
MIPSccu1 = (100 ×106) / (1.7 ×106) = 58.82
MIPSccu2 = (200 ×106) / (2.5 ×106) = 80
(3) Advance VLSI technology, faster components, advance computer
organization
(4) 32-bit or 64-bit CPU是指CPU一次可以處理的資料量為32-bit or 64-bit。可
由registers的bit數來決定。

2. Now 4 has designed another simple version of CPU (named, CCU0), which
does not support interrupt. Argue which of the following design techniques are
not possible for CCU0. Why?
(1) Pipelining
(2) Virtual memory
(3) Polling I/O
(4) Data forwarding
(5) Cache memory

Answer:

214
(1) possible: pipeline 的技術與中斷無關。
(2) impossible: 若發生 page fault 時必頇中斷 CPU 並由作業系統負責將 page
由 hard disk 搬至 main memory。
(3) possible: 因為 polling 並不需要中斷 CPU。
(4) possible: 因為 forward 並不會造成 CPU 中斷。
(5) possible: 因為 the cache miss handling is done with the processor control unit
and with a separate controller that initials the memory access and refills the
cache,因此發生 cache miss 時並不需要中斷 CPU 而只頇 stall CPU 即可。

3. The following figure shows the cache miss rate versus block size for five
different size caches in a memory system. Assume the memory system takes 40
clock cycles of overhead and then deliver 16 bytes every 2 clock cycles. Assume
a cache hit takes 1 cycle.
(1) Give the access time to fetch a data block for each block size
(2) Give which block size has the lowest average memory access time for 4K,
and 64K
(3) Give your observations from the figure. Explain why the lowest miss rate
occurs at different block size for different cache size
Cache size
Block size 1K 4K 16K 64K 256K
16 15.0% 8.5% 4.0% 2.0% 1.0%
32 13.0% 7.0% 3.0% 1.5% 0.7%
64 13.5% 7.0% 2.5% 1.0% 0.5%
128 17.0% 8.0% 3.0% 1.0% 0.5%
256 22.0% 9.5% 3.5% 1.2% 0.5%

Answer: (1)(2)
Block Block transfer Average memory access time (cycles)
size time (cycles) 4K Cache 64K Cache
16 40 + 1 × 2 = 42 1 + 42 × 0.085 = 4.57 1 + 42 × 0.02 = 1.84
32 40 + 2 × 2 = 44 1 + 44 × 0.07 = 4.08 1 + 44 × 0.015 = 1.66
64 40 + 4 × 2 = 48 1 + 48 × 0.07 = 4.36 1 + 48 × 0.01 = 1.48
128 40 + 8 × 2 = 56 1 + 56 × 0.08 = 5.48 1 + 56 × 0.01 = 1.56
256 40 + 16 × 2 = 72 1 + 72 × 0.095 = 7.84 1 + 72 × 0.012 = 1.864
(2) For 4K cache, block size 32 has the smallest AMAT
For 64K cache, block size 64 has the smallest AMAT
(3) 大的block size較能利用程式的spatial locality因此miss rate較低。但過大的
block size反而會因為block之間的conflict升高而增加miss rate。然而Cache
size愈大會因為block size過大所造成的miss的影響也就愈小。

215
4. Pipelining- consider a 5-stage pipeline like MIPS
(1) Explain why instruction work at each stage should be as balanced as possible.
Give an example to support your arguments.
(2) How to solve the data hazard due to an immediate use on a previous data load
from memory? Does it still require a stall? Why?
(3) The finish time of branch instructions can be moved early from MEM to ID.
What are the costs behind that? Is it possible to move earlier to the IF stage?

Answer:
(1) pipeline的clock cycle time是取決於最長執行時間之stage。如果不能儘量平
衡各stage之工作則會造成stage之間工作時間差異過大而執行時間短的
stage會量浪費較多的時間而影響pipeline performance。
假設某類指令的執行時間為8ns,我們分別設計兩個4-stage pipeline
machines來增加此類指令之performance。其中machine 1的4個stage所需的
時間分別為1ns-5ns-1ns-1ns,而machine 2的4個stage所需的時間分別為
2ns-2ns-2ns-2ns,則machine 1的clock rate為200MHz而machine 2的clock
rate為500MHz。如果我們分別令machines 1 and 2執行100個指令則
Execution time for machine 1 = ((4 – 1) + 100) × 5 ns = 515 ns
Execution time for machine 2 = ((4 – 1) + 100) × 2 ns = 206 ns
由此得知若stage分的不平均則clock rate會降低pipeline的performance也會
下降。
(2) 先stall一個clock cycle然後再以forwarding方式來解決此data hazard。
Because in the stag 4 the data is still being read from memory by load
instruction while the ALU in stage 3 is performing the operation for the
following instruction, we can not use forwarding only to completely remove
the load delays. If we stall a clock cycle between load and the following
instruction and then apply forwarding technique then the load delay can be
eliminate completely.
(3) 一但兩個register的data由register file取出後便可以一32-bit XOR array來比
較兩個register是否相等,因此其cost為增加32個XOR gates及1個NOR
gate。
因為在IF stage registers的值尚未被取出,所以無法比較register是否相等,
因此無法移至IF stage。

216
92 年中正資工

I. 簡答題: 請回答對錯並簡述理由
1. MIPS (million instruction per second) is not a good metric to measure the
performance of two processor with the cycle time, which is impractical for real
applications.
2. The decimal value of a hexadecimal fixed-point number, 0x1F.3 is 31.3.
3. In implementing pipelining for branch instructions, the adder of computing target
adders at the ID stage can be eliminated because the ALU can be free in the
decoding phase.
4. The reasons that the PLA can be more efficient than ROM for control unit are (1)
PLA has no duplicate entries and (2) PLA is easier to decode the address.
5. Cache miss rate is dependent on both the block size and cache associativity. The
cache miss rate decreases as the block size increases.
6. There are two ways of writing data in a hierarchical memory with cache:
write-through and write-back. The write-through policy costs less than write-back
because the write-through writes only a single word rather than a cache block.
7. For a given size of L1 cache, using the multi-level caches can reduce the average
miss penalty instead of the miss rate of L1 cache.
8. A daisy chain bus uses a bus grant line that chains through each device from
lowest to highest priority.
9. There are three ways in interfacing processors and peripherals: polling, interrupt,
and DMA. Among them, the polling I/O consumes the most amount of processor
time.
10. If a processor does not support precise interrupt, the virtual memory is not
possible as the OS has no way to resume the execution.

Answer:
1. True
2. False, should be 31.1875
3. False, the adder of computing target address at the ID stage can not be eliminated
4. False, no address decode is required for PLA
5. False, too large block size increase miss rate
6. False, write-through policy costs more than write-back does
7. True
8. False, from highest to lowest priority
9. True
10. True

217
II. 問答題
1. Use 4-bit carry lookahead adders to design a 16-bit carry lookahead adder.
(1) Give the block diagram of the 16-bit adder.
(2) Describe how carry signals, generate signals, and propagate signals are passed
between 4-bit carry lookahead adders.
(3) Assume that each 4-bit carry lookahead adder takes d time units to generate an
output carry after it receives the input carry. What‘s tee total addition time of
this 16-bit carry lookahead adder?

Answer:
CarryIn
(1) (2) Propagate signal:
a0 CarryIn
b0
a1
b1
Result0--3
P0 = p3p2p1p0
ALU0
a2
b2 P0
G0
pi
gi
P1 = p7p6p5p4
a3
b3
C1
Carry-lookahead unit P2 = p11p10p9p8
ci + 1
P3 = p15p14p13p12
a4 CarryIn
b4 Result4--7
a5
b5
a6
b6
ALU1
P1
G1
pi + 1
gi + 1
gi
Generate signal:
a7
b7
C2
ci + 2
G0 = g3 + (p3·g2) + (p3·p2·g1) + (p3·p2·p1·g0)
a8 CarryIn
b8
a9
Result8--11 G1 = g7 + (p7·g6) + (p7·p6·g5) + (p7·p6·p5·g4)
b9 ALU2
a10
b10
P2
G2
pi + 2
gi + 2
G2 = g11 + (p11·g10) + (p11·p10·g9) +
a11
b11
C3
(p11·p10·p9·g8)
ci + 3

a12 CarryIn
G3 = g15 + (p15·g14) + (p15·p14·g13) +
b12
a13
Result12--15
(p15·p14·p13·g12)
b13 ALU3
a14 P3 pi + 3
b14 G3 gi + 3
a1
a15 C4
b15 ci + 4 Carry signal:
CarryOut

C1 = G0 + c0P0
C2 = G1 + G0P1 + c0P0P1
C3 = G2 + G1P2 + G0P1P2 + c0P0P1P2
C4 = G3 + G2P3 + G1P2P3 + G0P1P2P3 +
c0P0P1P2P3

(3) gi和pi的產生需要1個gate delay。第一層進位的產生以pi和gi為輸入需2個gate


delay而第二層進位的產生以Pi和Gi為輸入也需要2個gate delays,其中Pi,Gi
的產生可以和第一層進位同時計算,因此第二層進位的產生共需1 + 2 + 2 = 5
gate delays。第二層進位產生後會加入4-bit carry lookahead adder,因此total
delay time = 5 gate delay + d + 3 gate delay = 8 gate delay + d。

註:d事實上為2個gate delay,因此答案也可以寫成5d。

218
2. Interrupt:
(1) Describe the differences between interrupt, exception, and trap.
(2) Write the detailed procedure of I/O devices with interrupt mechanism step by
step and show how these steps are performed via CPU, operating systems, or
devices.

Answer:
(1) Interrupt: 由硬體引發之事件而造成CPU不能順利執行下一個指令(事件
來自processor外)
Exception: 因執行指令所發生不預期的結果而造成CPU不能順利執行下
一個指令,例如除以0, overflow或undefined OP code等(事件來自processor
內)
Trap: 因執行系統呼叫指令所造成程式執行的轉移(來自processor內)
(2)
1. I/O device 需要服務時會經由 device controller 發出中斷信號通知 CPU
2. CPU 會儲存目前 program counter 的值並根據中斷信號來源而跳至作業
系統適當的位址執行
3. 作業系統保存目前執行的 process 狀態
4. CPU 執行作業系統的中斷處理常式
5. 中斷處理常式完成後,作業系統將被中斷執行前的 process 狀態回復並
跟據中斷時所儲存之 program counter 值跳回被中斷之 process 繼續執行

3. Pipelining:
We are plan to construct a new CPU: CCU (creative computing unit). Assume the
operation times of major components are: Mem = 4 ns, ALU = 2 ns ID&Register
rd = 2 ns, and Register wb = 1 ns, The instruction mix for applications is 25%
loads, 10% stores, 50% R-format operations and 15% branches. Consider an
organization with 5-stage pipelining:
(1) Determine max clock rates for 3 implementations of single-cycle,
multiple-cycle, and ideal pipeline (no hazard, no cache stalls) respectively.
(2) Compute the CPI for three implementations and their performance in term of
the Average Execution Time per Instruction. (TPI = CPI * cycle time).
(3) Considering the pipelining implementation, actually we have 3 cycle stall for
a branch. If we invent a way to reduce the stall with only 1 cycle, compute the
speedup obtained for such an invention.
(4) In fact Mem of 4 ns is only estimated. If we build a cache for the CCU, the
cache access time can be 2 ns, that is , Mem needs only 2 ns instead of 4 ns.
However, we need to stall and pay a memory latency penalty of 40 ns for
those 5% cache misses. How many stall cycles does the new pipelined CCU
have? Also compute the TPI for the new CPU.

Answer:

219
(1)
Machine single-cycle multiple-cycle pipeline
Cycle time 4 + 2 + 2 + 4 + 1 = 13 ns 4 ns 4 ns
Clock rate 76.92 MHz 250 MHz 250 MHz
(2)
Machine single-cycle multiple-cycle pipeline
CPI 1 5×0.25+4×0.1+4×0.5+3×0.15 = 4.1 1
TPI 1×13 = 13 ns 4.1×4= 16.4 ns 1×4 = 4 ns

(3) CPI before improve = 1 + 0.15 × 3 = 1.45


CPI after improve = 1 + 0.15 × 1 = 1.15
Speedup = Exectution time before improve / Exectution time after improve
= CPI before improve / CPI after improve = 1.45 / 1.15 = 1.26
(4) (a) miss penalty = 40 ns / 2 ns = 20
memory stall cycle per instruction = (1 + 0.25 + 0.1) × 0.05 × 20 = 1.35
(b) new CPI = 1 + 1.35 = 2.35
new TPI = 2.35 × 2 ns = 4.7 ns

4. Performance analysis of two bus schemes:


Suppose we have a system with the following characteristics:
(1) A memory and bus system supporting block access of 4 to 16 32-bit words.
(2) A 64-bit synchronous bus clocked at 200 MHz, with each 64-bit transfer
taking 1 clock cycle. And 1 clock cycle required to send an address to
memory.
(3) Two clock cycles needed between each bus operation. (Assume the bus is idle
before an access.)
(4) A memory access time for the first four words of 200 ns; each additional set of
four words can be read in 20 ns. Assume that a bus transfer of the most
recently read data and a read of the next four words can be overlapped.
(a) Find the sustained bandwidth and the latency for a read of 256 words for transfers
that use 4-word blocks and for transfers that use 16-word blocks.
(b) Compute the effective number of bus transactions per second for each case.
Recall that a single bus transaction consists of an address transmission followed
by data.

Answer:
(a) For the 4-word block transfers, each block takes
5. 1 clock cycle that is required to send the address to memory
6. 200ns / (5ns/cycle) = 40 clock cycles to read memory
7. 2 clock cycles to send the data from the memory
8. 2 idle clock cycles between this transfer and the next
This is a total of 45 cycles. The bus bandwidth is (4  4) bytes  (1sec / 45 
5ns) = 71.11 MB/sec
Each block takes 45 cycles and 256/4 = 64 transactions are needed, so the

220
entire transfer takes 45  64 = 2880 clock cycles. Thus the latency is 2880
cycles  5 ns/cycle = 14,400 ns.
For the 16-word block transfers, the first block requires
1. 1 clock cycle to send an address to memory
2. 200 ns or 40 cycles to read the first four words in memory
3. 2 cycles to send the data of the block, during which time the read of the
four words in the next block is started
4. 2 idle cycles between transfers and during which the read of the next
block is completed
Each of the three remaining 16-word blocks requires repeating only the last
two steps. Thus, the total number of cycles for each 16-word block is 1 + 40
+ 4  (2 + 2) = 57 cycles. The bus bandwidth with 16-word blocks is (16  4)
bytes  (l sec / 57  5 ns) = 224.56 MB/second.
Each block takes 57 cycles and 256/16 =16 transactions are needed, so the
entire transfer takes, 57  16 = 912 cycles. Thus the latency is 912 cycles  5
ns/cycle = 4560 ns.
(b) For the 4-word block transfers:
The number of bus transactions per second is 64 transactions  (1sec /
14,400ns) = 4.44M transactions/second
For the 16-word block transfers:
The number of bus transactions per second with 16-word blocks is 16
transactions  1second/4560 ns = 3.51 M transactions/second.

221
96 年中山電機

1. Identify two differences between the following terminology pairs.


(1) Computer Networks vs Cluster Computers
(2) Multi-Core Server vs Multi-processor Server
(3) VLIW vs Superscalar
(4) Synchronous DRAM vs Cache DRAM
(5) TLB (Translation Lookaside Buffer) vs Page Table

Answer:
(1) A network is a collection of computers and devices connected to each other.
The network allows computers to communicate with each other and share
resources and information.
A computer cluster is a group of linked computers, working together closely
so that in many respects they form a single computer. The components of a
cluster are commonly connected to each other through fast local area
networks.
(2) A multi-core Server is a computer that has two or more independent cores
(normally a CPU) that are packaged into a single IC.
Multi-processor Server is a computer that has two or more processors that
have common access to a main memory.
(3) Both VLIW and Superscalar can issue more than one insturction to the
execution units per cycle. But in a VLIW approach, compiler decides which
instructions can be run in parallel. In a superscalar approach, the hardware
decides which instructions can be run in parallel.
(4) Both Synchronous DRAM and Cache DRAM are used to improve the
performance of DRAM.
A DRAM with an on-chip cache, called the cache DRAM. That is, cache
DRAM integrates SRAM cache onto generic DRAM chips.
Typical DRAM is asynchronous but Synchronous DRAM is synchronous
which exchanges data with the processor synchronized to an external clock
signal and running at full speed of the processor/memory bus without
imposing wait states.
(5) TLB is cache of Page Table. TLB is stored in cache and page table is strored
in memory. TLB has tag field but page table has not.

222
2. Consider a hypothetical 32-bit microprocessor having 32-bit instructions
composed of two fields: the first byte contains the OP code and the remainder the
immediate operand or an operand address. Assume that the local address bus is 32
bits and the local data bus is 16 bits. No time multiplexing between the address
and data buses.
(1) What is the maximum directly addressable memory capacity (in bytes)?
(2) What is the minimum bit numbers required for the program counter?
(3) Assume the direct addressing mode is applied, how many address and data
bus cycles required to fetch an instruction and its corresponding C or data
from memory?

Answer:
OP Address/ immediate
Instruction format: 8 24
(1) The maximum directly addressable memory capacity = 224 = 16Mbytes
(2) The minimum bit numbers required for the PC = Min(24, 32) = 24 bits
(3)
Address bus cycle Data bus cycle
Instruction fetch 1 2
Operand fetch 1 2

3. Perform the following three Intel X86 instructions,


MOV AX 0248H
MOV BX 0564H
CMP AX BX
and list the Carry Flag(CF), Overflow Flag(OF), Parity Flag(PF), Sign Flag(SF),
and Zero Flag(ZF).

Answer:

0248H – 0564H = 00000010010010002 – 00000101011001002 =


00000010010010002
+ 11111010100111002
11111100111001002

Carry Flag Overflow Flag Parity Flag Sign Flag Zero Flag
0 0 0 1 0

註: instruction 1 move Hexcimal constant 0248 to register AX

223
instruction 2 move Hexcimal constant 0564 to register BX
instruction 3 compare registers by subtract BX from AX. (AX – BX)
在計算的結果中 1 的個數如果是偶數,PF=1,反之 PF=0

4. Analysis of Program Structures.


(1) Analyze the following program, and find out how many times the statement
"sum ++ " are executed.
sum = 0;
For (i = 0; i < n; j++)
h = i + l;
For (j = 0; j < h * h; j++)
sum ++;
(2) Analyze the following program, and find out how many times the statement
"A(i, j, k)" are executed.
For k = 1 to n
For i = 0 to k-1
For j = 0 to k-l
For i ≠ j then A(i, j, k)
End
End
End

Answer:
(1) If h = n, the statement "sum ++ " will execute n2 times in the inner for loop.
Since i can be 0 to n – 1  h can be 1 to n, the statement "sum ++ " will
n
n(n  1)(2n  1)
execute i
i 1
2

6
(2) If K = n, the function A(i, j, k) will execute n  n – n = n2 – n times in the two
inner for loops. K can be 1 to n  A(i, j, k) will be executed
n n n
n(n  1)(2n  1) n(n  1) n(n  1)(n  1)
 (i
i 1
2
 i )   i  i 
i 1
2

i 1 6

2

3
times

5. Hamming error correction codes.


(1) How many check-bits are needed if the Hamming error correction code is
used to detect single bit errors in a 1024-bit data word?
(2) For the 8-bit word 00111001, the check bits stored with it would be 0111.
Suppose when the word is read from memory, the check bits are calculated to
be 1101. What is the data word that was read from memory?

224
Answer:
(1) 2k – 1 ≥ 1024 + k  k = 11
(2)
12 11 10 9 8 7 6 5 4 3 2 1
M8 M7 M6 M5 C4 M4 M3 M2 C4 M1 C2 C1
0 0 1 1 0 1 0 0 1 1 1 1
C1 = M1⊕M2⊕M4⊕M5⊕M7 = 1⊕0⊕1⊕1⊕0 = 1
C2 = M1⊕M3⊕M4⊕M6⊕M7 = 1⊕0⊕1⊕1⊕0 = 1
C4 = M2⊕M3⊕M4⊕M8 = 0⊕0⊕1⊕0 = 1
C8 = M5⊕M6⊕M7⊕M8 = 1⊕1⊕0⊕0 = 0
C8 C4 C2 C1
0 1 1 1
⊕ 1 1 0 1
1 0 1 0
The result 1010, indicating that bit position 10, which contain M6, is in error.
12 11 10 9 8 7 6 5 4 3 2 1
M8 M7 M6 M5 C4 M4 M3 M2 C4 M1 C2 C1
0 0 0 1 0 1 0 0 1 1 1 1
The data word read from memory should be: 00011001

6. Consider a cache and a main memory hierarchy, in which cache = 32K words,
main memory = 128M words, cache block size = 8 words, and word size = 4
bytes.
(1) Show physical address format for Direct Mapping (How many bits in Tag,
Block, and Word?)
(2) Show physical address format for 4-way Set Associative Mapping (How
many bits in Tag, Set, and Word?)
(3) Show physical address format for Sector Mapping with 16 blocks per sector.
(How many bits in Sector, Block, and Word?)

Answer:

Memory = 128M words = 512 M bytes = 229 bytes


32K words
(1) Number of blocks = = 4 M = 212
8 words

225
Tag Block (index) Word (offset)
29 – 17 = 12 12 5
32K words
(2) Number of sets = = 1 M = 210
4  8 words
Tag Block (index) Word (offset)
29 – 15 = 14 10 5
32K words
(3) Number of sets = = 1 M = 210
4  8 words
Sector Block Word (offset)
29 – 9 = 20 4 5

226
96 年中山資工

If some questions are unclear or not well defined to you, you can make your own
assumptions and state them clearly in the answer sheet.

1. Choose the most appropriate answer (one only) to each following question.
1.1 Which of the following MIPS addressing mode means that the operand is a
constant within the instruction itself? (a) Register addressing (b) Immediate
addressing (c) Base addressing (d) PC-relative addressing
1.2 Which of the following feature is typical for the RISC machine? (a)
Powerful instructions (b) Large CPI (c) More addressing modes (d) Poor
code density
1.3 Which is the IEEE 754 binary representation for the floating point number
-0.4375ten in single precision? (a) 1 11111110 111 00000000000000000000
(b) 1 11111110 11000000000000000000000 (c) 1 01111101
11100000000000000000000 (d) 1 01111101 11000000000000000000000
1.4 A program runs in 10 seconds on computer A (which has a 4 GHz clock)
and 6 second on computer B. If computer B requires 1.2 times as many
clock cycles as computer A for this program. What clock rate does computer
B have? (a) 6 GHz (b) 7 GHz (c) 8 GHz (d) 9 GHz
1.5 Pipelining improves (a) Instruction throughput (b) Individual instruction
execution time (c) Individual instruction latency (d) All of the above are
correct
1.6 Which of the following technique is associated primarily with a
hardware-based approach to exploiting instruction-level parallelism? (a)
Very long instruction word (VLIW) (b) Explicitly parallel instruction
computer (EPIC) (c) Dynamic pipeline scheduling (d) Register renaming
1.7 Consider a cache with 64 blocks and a block size of 16 bytes. What block
number does byte address 1200 map to? (a) 10 (b) 11 (c) 12 (d) 13
1.8 Which of the following statement about "write back" is incorrect? (a) new
value is written only to the block in the cache (b) the modified block is
written to the lower level of the hierarchy when it is replaced (c) more
complex to implement than write-through (d) can ensure that data is always
consistent between cache and memory
1.9 Which of the following statement is incorrect? (a) The compiler must
understand the pipeline to achieve the best performance, (b) Deeper
pipelining usually increases clock rate. (c) Increasing associativity of cache
may slow access time, leading to lower overall performance, (d) The
addition of second level cache can reduce miss rate of the first level cache.
1.10 In a magnetic disk, the disks containing the data are constantly rotating. On
average it should take half a revolution for the desired data on the disk to
spin under the read/write head. Assuming that the disk is rotating at 10,000
revolutions per minute (RPM), what is the average time for the data to rotate
under the disk head? (a) 0.1 ms (b) 0.2 ms (c) 3 ms (d) 6 ms

227
Answer:
1.1 1.2 1.3 1.4 1.5 1.6 1.7 1.8 1.9 1.10
(b) (d) (d) (c) (a) (c) (b) (d) (d) (c)

2. Performance Analysis
2.1 The following measurements have been made on two different computers: M1
and M2.
Program Time on M1 Time on M2
1 2.0 seconds 1.5 seconds
2 5.0 seconds 10.0 seconds

Program Instructions executed on M1 Instructions executed on M2


1 5  109 6  109
If the clock rates of M1 and M2 are 4 GHz and 6 GHz, respectively, find the
clock cycles per instruction (CPI) for program 1 on both computers.

Answer:
2  4  10 9 1.5  6  10 9
CPI for M1 = = 1.6, CPI for M2 = = 1.5
5  10 9 6  10 9

2.2 Assuming the CPI for program 2 on each computer in Problem 2.1 is the same
as the CPI for program 1, find the instruction count for program 2 running on
each computer.

Answer:
4  10 9
Instruction Count for M1 = = 2.5  109
1.6
6  10 9
Instruction Count for M2 = = 4  109
1.5

228
2.3 A compiler designer is trying to decide between two code sequences for a
particular computer. The hardware designers have supplied the following facts:
CPI for this instruction class
A B C
CPI 1 2 3
For a particular high-level-language statement, the compiler writer is
considering two code sequences that require the following instruction counts:
Instruction counts for instruction class
Code sequence
A B C
1 2 1 2
2 4 1 1
Which code sequence executes the most instructions? Which will be faster?
What is the CPI for each sequence?

Answer:
(1) Instruction count for code sequence 1 = 2 + 1 + 2 = 5
Instruction count for code sequence 2 = 4 + 1 + 1 = 6
Hence, code sequence 2 executes the most instructions
(2) Clock cycles for code sequence 1 = 2  1 + 1  2 + 2  3 = 10
Clock cycles for code sequence 2 = 4  1 + 1  2 + 1  3 = 9
Hence, code sequence 2 is faster than code sequence 1
(3) CPI for code sequence 1 = 10 / 5 = 2
CPI for code sequence 2 = 9 / 6 = 1.5

2.4 You could speed up a Java program on a new computer by adding hardware
support for garbage collection. Garbage collection currently comprises 20% of
the cycles of the program. You have two possible changes to the machine. The
first one would be to automatically handle garbage collection in hardware. This
causes an increase in cycle time by a factor of 1.2. The second would be to
provide for new hardware instructions to be added to the ISA that could be used
during garbage collection. This would halve the number of instruction needed for
garbage collections but increase the cycle time by 1.1. Which of theses two
options, if either, should you choose? Why?

Answer:
Automatic garbage collection by hardware: The execution time of the new
machine is (1 − 0.2) × 1.2 = 0.96 times that of the original.
Special garbage collection instructions: The execution time of the new
machine is (1 − 0.2/2) × 1.1 = 0.99 times that of the original.
Therefore, the first option is the best choice.

229
3. Datapath and Control
3.1 Consider the following machines, and compare their performance using the
following instruction frequencies: 25% Loads, 13% Stores, 47% R-type
instructions, and 15% Branch/Jump.
M1: The multicycle datapath shown in Fig. 1 with a 4 GHz clock.
M2: A machine like the multicycle datapath of Fig. 1, except that register updates
are done in the same clock cycle as a memory read or ALU operation. Thus
in Fig. 2 (which shows the complete finite state machine control of Fig. 1),
states 6 and 7 and states 3 and 4 are combined. This machine has a 3.2 GHz
clock, since the register update increases the length of the critical path.
M3: A machine like M2 except that effective address calculations are done in the
same clock cycle as a memory access. Thus states 2, 3, and 4 can be
combined, as can 2 and 5, as well as 6 and 7. This machine has a 2.8 GHz
clock because of the long cycle created by combining address calculation
and memory access.
Find the effective CPI and MIPS (million instructions per second) for all
machines.

-
- -
-
-
-
-

Figure 1

230
Figure 2

Answer:
Instruction Frequency M1 M2 M3
Loads CPI 25% 5 4 3
Stores CPI 13% 4 4 3
R-type CPI 47% 4 3 3
Branch/jump CPI 15% 3 3 3
Effective CPI 4.1 3.38 3
MIPS 976 946 933

231
3.2 Exception detection is an important aspect of execution handling. Try to identify
the cycle in which the following exceptions can be detected for the multicycle
datapath in Fig. 1. Consider the following exceptions:
a. Overflow exception
b. Invalid instruction
c. External interrupt
d. Invalid instruction memory address
e. Invalid data memory address

Answer:
a b c d e
Detection time cycle 4 cycle 2 any cycle cycle 1 cycle 3

4. Pipelining
4.1 Consider the following code segment in C:
A = B + E;
C = B + F;
Here is the generated MIPS code for this segment, assuming all variables are in
memory and are addressable as offsets from $t0:
lw $t1, 0($t0)
lw $t2, 4($t0)
add $t3, $t1, $t2
sw $t3, 12($t0)
lw $t4, 8($t0)
add $t5, $t1, $t4
sw $t5, 16($t0)
Find the hazards in the code segment and reorder the instructions to avoid any
pipeline stalls.
Answer:
Both add instructions have a hazard because of their respective dependence on
the immediately preceding lw instruction. Notice that bypassing eliminates
several other potential hazards including the dependence of the first add on the
first lw and any hazards for store instructions. Moving up the third lw
instruction eliminates both hazards:
lw $t1, 0($t0)
lw $t2, 4($t0)
lw $t4, 8($t0)
add $t3, $t1, $t2
sw $t3, 12($t0)
add $t5, $t1, $t4
sw $t5, 16($t0)
註:此題為課本範例,課本假設有 forwarding unit。在此題目並未說明是否有
forwarding unit,答案是以課本為主。

232
4.2 MIPS instructions classically take five steps (IF, ID, EX, MEM WB) to execute
in pipeline. To resolving control hazards, the decision about whether to branch in
MIPS architecture is moved from MBM stage to the ID stage. Explain the
advantage, difficulties, and how to overcome the difficulties when moving the
branch execution to the ID stage.

Answer:
Move the branch execution from MEM to the ID, then only 1 instructions need
flushed. Moving the branch decision up requires two actions to occurs earlier:
(1) Compute the branch target: move the branch adder from the EXE stage to the
ID stage
(2) Equality test of two registers: XOR their respective bits and NOR all the
results

4.3 Compare the performance for single-cycle, multicycle, and pipelined control by
the average instruction time using the following instruction frequencies (25%
loads, 10% stores, 11 % branches, 2% jumps, and 52% ALU instructions) and
functional unit times (200 ps for memory access, 100 ps for ALL) operation, and
50 ps for register file read or write). For the multicycle design, the number of
clock cycles for each instruction class is shown in Fig. 2. For the pipelined design,
loads take 1 clock cycle when there is no load-use dependence and 2 when there
is. Branches take 1 when predicted correctly and 2 when not. Jumps always pay 1
full clock cycle of delay, so their average time is 2 clock cycles. Other
instructions take 1 clock cycle. For pipelined execution, assume that half of the
load instructions are immediately followed by an instruction that uses the result
and that one-quarter of the branches are mispredicted. Ignore any other hazards.

Answer:
For single-cycle machine:
CPI = 1
Clock cycle time = 200 + 50 + 100 + 200 + 50 = 600 ps
Average instruction time = 1  600 = 600 ps
For multicycle machine:
CPI = 0.25  5 + 0.1  4 + 0.11  3 + 0.02  3 + 0.52  4 = 4.12
Clock cycle time = 200 ps
Average instruction time = 4.12  200 = 824 ps
For pipeline machine:
Effective CPI = 1.5  0.25 + 1  0.1 + 1  0.52 + 1.25  0.11 + 2  0.2 = 1.17
Clock cycle time = 200 ps
Average instruction time = 1.17  200 = 234 ps
The relative performance of the three machines is
Pipeline > single-cycle > multi-cycle
in terms of average instruction time.

233
4.4 Suppose the memory access became 2 clock cycles long. Find the relative
performance of the single-cycle and multicycle designs by the average instruction
time as described in Problem 4 3.

Answer:
For single-cycle machine:
CPI = 1
Clock cycle time = 200 + 50 + 100 + 200 + 50 = 600 ps
Average instruction time = 1  600 = 600 ps
For multicycle machine:
CPI = 0.25  7 + 0.1  6 + 0.11  4 + 0.02  4 + 0.52  5 = 5.47
Clock cycle time = 100 ps
Average instruction time = 5.47  100 = 547 ps

5. Memory Hierarchy
5.1 Suppose we have a processor with a base CPI of 1.0, assuming all references hit
in the primary cache, and a clock rate of 5 GHz. Assume a main memory access
time of 100 ns, including all the miss handling. Suppose the miss rate per
instruction at the primary cache is 2%. How much faster will the processor be if
we add a secondary cache that has a 5 ns access time for either a hit or a miss and
is large enough to reduce the miss rate to main memory to 0.6%?

Answer:
The miss penalty to main memory is 100 / 0.2 = 500 clock cycles
For the processor with one level of caching, total CPI = 1.0 + 500 × 2% = 11.0
The miss penalty for an access to the second-level cache is 5 / 0.2 = 25 clock
cycles
For the two-level cache, total CPI = 1.0 + 2% × 25 + 0.6% × 500 = 4.5
Thus, the processor with the secondary cache is faster by 11.0/4.5 = 2.44

234
5.2 Consider a memory hierarchy using one of the following three organizations for
main memory: (a) one-word-wide memory organization, (b) wide memory
organization, and (c) interleaved memory organization. Assume that the cache
block size is 16 words, that the width of organization (b) is four words, and that
the number of banks in organization (c) is four. If the main memory latency for a
new access is 10 memory bus clock cycles and the transfer time is 1 memory bus
clock cycle. Assume that it takes 1 clock cycle to send the address to the main
memory, what are the miss penalties for each of these organizations?

Answer:
(a) one-word-wide memory organization:
the miss penalty would be 1 + 16  10 + 16  1 = 177 clock cycles.
(b) wide memory organization (four-word-wide):
the miss penalty would be 1 + 4  10 + 4  1 = 45 clock cycles.
(c) interleaved memory organization:
the miss penalty would be 1 + 4  10 + 16  1 = 57 clock cycles.

235
95 年中山資工
If some questions are unclear or not well defined to you, you can make your own
assumptions and state them clearly in the answer sheet.
1. Short Questions:
Answer and explain the following questions.
(1) Given the bit pattern: 1000 1110 1110 1111 0100 0000 0000 0000
What does it represent, assuming that it is
 a two's complement integer?
 a single precision floating-point number?
(2) Explain what "exception" is and how exceptions are handled.
(3) In the simplest implementation for MIPS instruction set, every instruction
begins execution on one clock edge and completes execution on the next
clock edge. Please explain disadvantages of the single-cycle
implementation.
Answer:
(1)  −231 + 227 + 226 + 225 + 223 + 222 + 221 + 219 + 218 + 217 + 216 + 214
 − 1.1101111012  229 – 127 = − 1.1101111012  2–98
(2) Exception: An unscheduled event (from within the CPU) that disrupts
program execution.
Handle: save the address of the offending instruction in the EPC and
transfer control to the operating system at some specified address
(3) Disadvantages: a single cycle machine is inefficient both in performance
and hardware cast. Since:
 The clock cycle of single cycle machine is equal to the worst-case delay
for all instruction and the penalty for using a fixed clock cycle is
significant.
 each functional unit can be used only once per clock; therefore, some
functional units must be duplicated, raising the cost of hardware.

2. Performance Analysis:
(1) Consider the machine with three instruction classes X, Y, and Z. The
corresponding CPIs for these instruction classes are 1, 2, and 3,
respectively. Now suppose we measure the code for the same program
from two different compilers and obtain the following data:
Code from Instruction counts (in billions) for each instruction class
X Y Z
Compiler 1 5 1 1
Compiler 2 10 1 1
Assume that the machine's clock rate is 500 MHz. Which code sequence will
execute faster according to MIPS and according to execution time?
(2) The table below shows the number of floating-point operations executed in
two different programs and the runtime for those programs on three different
machines:

236
Program Floating-point operations Execution time in seconds
Machine A Machine B Machine C
Program 1 100,000,000 1000 100 40
Program 2 1,000,000 1 10 40
Which machine is fastest according to total execution time?
(3) Assume that equal amounts of time will be spent running each program on
some machine. Which machine is fastest using the data of Table 2 and
assuming a weighting that generates equal execution time for each benchmark
on machine A? Which machine is fastest if we assume a weighting that
generates equal execution time for each benchmark on machine B?
(4) There are two possible improvements: either make multiply instruction run
four times faster than before, or make memory access instructions run two
times faster than before. You repeatedly run a program that takes 100 seconds
to execute. Of this time, 20% is used for multiplication, 40% for memory
access instructions, and 40% for other tasks. What will the speedup be if you
improve only memory access? What will the speedup be if both
improvements are made?

Answer:
(5  1  1  2  1  3)  109
(1) Execution Time for compiler 1 = = 20 s
500  10 6
(10  1  1  2  1  3)  10 9
Execution Time for compiler 2 = = 30 s
500  10 6
(5  1  1)  10 9
MIPS for compiler 1 = = 350
20  10 6
(10  1  1)  10 9
MIPS for compiler 2 = = 400
30  10 6
According to execution time, code sequence from compiler 1 is faster.
According to MIPS, code sequence from compiler 2 is faster.
(2) Execution Time for Machine A = 1000 + 1 = 1001
Execution Time for Machine B = 100 + 10 = 110
Execution Time for Machine C = 40 + 40 = 80
Hence, Machine C is fastest.
(3)  Equal time no Machine A
Program Weight Machine A Machine B Machine C
Program 1 1/1000 1000 100 40
Program 2 1 1 10 40
Weighted AM 2 10.1 40
Hence, Machine A is the fastest.
0.001 1
註:Weighted AM for Machine A =  1000  1  2
1.001 1.001

237
0.001 1
Weighted AM for Machine B =  100   10  10.1
1.001 1.001
0.001 1
Weighted AM for Machine C =  40   40  40
1.001 1.001
 Equal time no Machine B
Program Weight Machine A Machine B Machine C
Program 1 1/10 1000 100 40
Program 2 1 1 10 40
Weighted AM 91.8 18.2 40
Hence, Machine B is the fastest.
0.1 1
註:Weighted AM for Machine A =  1000   1  91.8
1.1 1.1
0.1 1
Weighted AM for Machine B =  100   10  18.2
1.1 1.1
0.1 1
Weighted AM for Machine C =  40   40  40
1.1 1.1
1
(4) Speedup by improving memory access =  1.25
0.4
 0.6
2
1
Speedup by improving both =  1.54
0.2 0.4
  0.4
4 2

3. Instruction Set:
(1) What is the "addressing mode"? Please explain "displacement addressing" and
"PC-relative addressing".
(2) Memory-memory and load-store are two architectural styles of instruction
sets. We can calculate the instruction bytes fetched and the memory data
bytes transferred using the following assumptions about the two instruction
sets:
• The opcode is always 1 byte (8 bits)
• All memory address are 2 bytes (16 bits)
• All data operands are 4 bytes (32 bits)
• All instructions are an integral number of bytes in length
• There are no optimizations to reduce memory traffic
For the following C code, write an equivalent assembly language program
in each architecture style (assume all variables are initially in memory):
a = b + c;
b = a + c;
d = a – b;
For each code sequence, calculate the instruction bytes fetched and the
memory data bytes transferred (read or written). Which architecture is most
efficient as measured by code size? Which architecture is most efficient as
measured by total memory bandwidth required (code + data)?

238
Answer:
(1) Multiple forms of addressing are generically called
addressing modes.
Displacement addressing: the operand is at the memory location whose
address is the sum of a register and a constant in the instruction
PC-relative addressing: the address is the sum of the PC and a constant in the
instruction
(2)  Memory-memory:
Instructions Code bytes Data bytes
add a , b, c 1+2+2+2=7 4 + 4 + 4 = 12
add b, a, c 1+2+2+2=7 4 + 4 + 4 = 12
sub d, a, b 1+2+2+2=7 4 + 4 + 4 = 12
Total 21 36
Total memory bandwidth = 21 + 36 = 57
 Load-store: suppose there are 16 registers in the CPU
Instructions Code bytes Data bytes
load $1, b 1  4 / 8  2  4 4
load $2, c 1  4 / 8  2  4 4
add $3, $1, $2 1  3  (4 / 8)  3 0
add $4, $3, $2 1  3  (4 / 8)  3 0
sub $5, $3, $4 1  3  (4 / 8)  3 0
store $3, a 1  4 / 8  2  4 4
store $4, b 1  4 / 8  2  4 4
store $5, d 1  4 / 8  2  4 4
Total 29 20
Total memory bandwidth = 29 + 20 = 49
According to code size, Memory-memory architecture is more efficient
According to memory bandwidth, load-store architecture is more efficient

4. Pipelining
(1) MIPS instructions classically take five steps to execute in pipeline. Please
explain the detailed operations of the five-stage pipeline used in MIPS
instructions.
(2) Explain the three different hazards: data hazards, control hazards, and
structural hazards. Describe the schemes for resolving these hazards.
(3) For each pipeline register in Fig. 1, label each portion of the pipeline register
with the name of the value that is loaded into the register. Determine the
length of each field in bits. For example, the IF/ID pipeline register contains
two fields, one of which is an instruction field that is 32 bits wide.

239
Fig. 1

Answer:
(1)
1. IF: Instruction fetch
2. ID: Instruction Decode and register file read
3. EX: Execution or address calculation
4. MEM: Data memory access
5. WB: Write back
(2) 
Structural hazards: hardware cannot support the instructions executing in the
same clock cycle
Data hazards: attempt to use item before it is ready
Control hazards: attempt to make a decision before condition is evaluated

Type Solutions
Structure 增加足夠的硬體 (例如: use two memories, one for instruction and
hazard one for data)
Software solution: (1).使用compiler插入足夠的no operation (nop)
指令。(2).重排指令的順序使data hazard情形消失。
Data Hardware solution: (1).使用Forwarding前饋資料給後續有data
hazard hazard的指令。(2).若是遇到load-use這種無法單用Forwarding解決
的data hazard時則頇先暫停(stall)一個時脈週期後再使用
Forwarding
Control Software solution:使用compiler插入足夠的no operation(nop)指令。

240
hazard Hardware solution: (1).將分支判斷提前以減少發生control hazard
時所需要清除指令的個數。(2).使用預測方法(static or dynamic)。
當預測正確時pipeline便可全速運作其效能就不會因為分支指令
而降低。當預測不正確時我們才需要清除掉pipeline中擷取錯誤的
指令。

(3)
Pipeline
IF/ID ID/EX EX/MEM MEM/WB
register
PC + 4 PC + 4 Branch target Memory
(32 bits) (32 bits) address data
(32 bits) (32 bits)
Instruction Register data 1 Zero indicator ALU result
(32 bits) (32 bits) (1 bit) (32 bits)
Register data 2 ALU result Destination
(32 bits) (32 bits) register No.
(5 bits)
Sign-extension Register data 2
unit output (32 bits)
(32 bits)
Register No. rt Destination register
(5 bits) (5 bits)
Register No. rd
(5 bits)
Total (bits) 64 138 102 69

5. Memory Hierarchy:
(1) Please explain what memory hierarchy is and why it is necessary.
(2) Cache misses can be sorted into three simple categories: Compulsory,
Capacity, and Conflict. Please explain why they occur and how to reduce
them.
(3) Consider three machines with different cache configurations:
• Cache 1: Direct-mapped with one-word blocks
• Cache 2: Direct-mapped with four-word blocks
• Cache 3: Two-way set associative with four-word blocks
The following miss rate measurements have been made:
• Cache 1: Instruction miss rate is 4%; data miss rate is 8%
• Cache 2: Instruction miss rate is 2%; data miss rate is 6%
• Cache 3: Instruction miss rate is 2%; data miss rate is 4%
For these machines, one-half of the instructions contain a data reference.
Assume that the cache miss penalty is 6 + Block size in words. The CPI for
this workload was measured on a machine with cache 1 and was found to be

241
2.0. Determine which machine spends the most cycles on cache misses.
(4) The cycle times for the machines in Problem 5.3 are 2 ns for the first and
second machines and 2.5 ns for the third machine. Determine which machine
is the fastest and which is the slowest.

Answer:
(1) Memory hierarchy: A structure that uses multiple levels of memories; as the
distance from the CPU increases, the size of the memories and the access time
both increase.
The reasons for using Memory hierarchy:
 To take the advantage of the principle of locality
 To provide the user with as much memory as is available in the cheapest
technology, while providing access at the speed offered by the fastest
memory.
(2)
Explanation Solution
Compulsory first access to a block increase block size
cache cannot contain all blocks
Capacity increase cache size
accessed by the program
multiple memory locations
Conflict mapped to the same cache increase associativity
location

(3) C1 spends the most time on cache misses


Miss
Cache I cache miss D cache miss Total Miss
penalty
C1 6+1=7 4% × 7 = 0.28 8% × 7 = 0.56 0.28 + 0.56/2 = 0.56
C2 6 + 4 = 10 2% × 10 = 0.2 6% × 10 = 0.6 0.2 + 0.6/2 = 0.5
C3 6 + 4 = 10 2% × 10 = 0.2 4% × 10 = 0.4 0.2 + 0.4/2 = 0.4
(4)
We need to calculate the base CPI that applies to all three processors. Since
we are given CPI = 2 for C1, CPIbase = CPI – CPImisses = 2 – 0.56 = 1.44
Execution Time for C1 = 2 × 2 ns × IC = 4 × 10-9 × IC
Execution Time for C2 = (1.44 + 0.5) × 2 ns × IC = 3.96 × 10-9 × IC
Execution Time for C3 = (1.44 + 0.4) × 2.5 ns × IC = 4.6 × 10-9 × IC
Therefore C2 is fastest and C3 is slowest.

242
94 年中山資工

1. Short Questions: Answer and explain the following questions. Credit will be
given only if explanation is provided.
(1) Explain the problems with using MIPS (million instructions per second) as a
measure for comparing machines.
(2) There are two possible improvements to enhance a machine: either make
multiply instructions run four times faster than before, or make memory
access instructions run two times faster than before. You repeatedly run a
program that takes 100 seconds to execute. Of this time, 10% is used for
multiplication, 50% for memory access instructions, and 40% for other tasks.
What will the speedup be if you improve only memory access? What will the
speedup be if both improvements are made?
(3) Explain and compare microprogrammed control and hardwire control.
(4) There are two basic options when writing to the cache: write through and
write back. Please explain write through and write back, and describe their
advantages.
(5) Explain the differences among superpipelining, superscalar and dynamic
pipeline scheduling.

Answer:
(1)
1. MIPS Specifies the instruction execution rate but does not take into
account the capabilities of the instructions (we can not compare
computers with different instruction sets using MIPS, since the
instruction counts will certainly differs)
2. MIPS varies between programs on the same computer
3. MIPS can vary inversely with performance
(2)
100
1. Speedup (improve memory) =  1.33
50 / 2  50
100
2. Speedup (improve both) =  1.48
10 / 4  50 / 2  40
(3)
Microprogrammed Hardwire
An implementation of finite state
A method of specifying control machine control typically using
方法 that uses microcode rather than a programmable logic arrays (PLAs)
finite state representation. or collections of PLAs and random
logic.
(1) Flexibility: make changes late
(1) easy to pipeline
優點 in design cycle (easy to change)
(2) less cost to implement
(2) Generality: can implement

243
multiple instruction sets on
same machine
(3) can implement more powerful
instruction sets
(1) hard to design
(2) lack of flexibility (hard to
(1) hard to pipeline
缺點 change),
(2) costly to implement
(3) lack of generality (a instruction
set only for one machine)

(4)
Write-through Write-back
A scheme that handles writes by
A scheme in which writs always
updating values only to the block in
update both the cache and the
方法 the cache, then writing the modified
memory, ensuring that data is
block to the memory when the
always consistent between the two.
block is replaced
只有當block要被置換時才需寫回
優點 實作容易
Memory,因此CPU write速度較快
因為每次寫入cache時也都要寫入
缺點 實作較難
Memory,因此CPU write速度較慢
(5) Superpipelining: An advanced pipelining technique that increases the depth of
the pipeline to overlap more instructions
Superscalar: An advanced pipelining technique that enables the processor to
execute more than on instruction per clock cycle
Dynamic pipeline scheduling: Hardware support for reordering the order of
instruction execution so as to avoid stalls

2. Performance Analysis:
(1) The PowerPC, made by IBM and Motorola and used in the Apple Macintosh,
shares many similarities to MIPS. The primary difference is two more
addressing modes (indexed addressing and update addressing) plus a few
operations. Please explain the two addressing modes provided by PowerPC.
(2) Consider an architecture that is similar to MIPS except that it supports update
addressing for data transfer instructions. If we run gcc using this architecture,
some percentage of the data transfer instructions shown in Fig. 1 will be able
to make use of the new instructions, and for each instruction changed, one
arithmetic instruction can be eliminated. If 20% of the data transfer
instructions can be changed, which will be faster for gcc, the modified MIPS
architecture or the unmodified architecture? How much faster? (Assume that
the modified architecture has its cycle time increasing by 10% in order to
accommodate the new instructions.)
(3) When designing memory systems, it becomes useful to know the frequency of

244
memory reads versus writes as well as the frequency of accesses for
instruction-mix information for MIPS for the program gcc in Fig. 1, find the
following:
(a) The percentage of all memory accesses that are for data (vs. instructions).
(b) The percentage of all memory accesses that are writes (vs. reads).

Answer:
(1) Indexed addressing:
lw $t1, $a0 + $s3 #$t1 = Memory[$a0 + $s3]
Update addressing (update a register as part of load):
lwu $t0,4($s3) #$t0 = Memory[$s3 + 4]; $s3 = $s3 + 4
(2) Executionunmodified: (1 × 0.48 + 1.4 × 0.35 + 1.7 × 0.15 + 1.2 × 0.02) × IC × T =
1.249× IC × T
1  0.48  0.35  0.2  1.4  0.35  1.7  0.15  1.2  0.02
Executionmodified:
0.48  0.35  0.2  0.35  0.15  0.02  0.93 × IC ×
0.93 × 1.1T = 1.297 × IC × T
Unmodified architecture is faster than modified architecture by 1.297/1.249 =
1.038 times
(3) (a) 0.35/1.35 = 26%
(b) 0.35 × 0.5 / 1.35 = 13%

註:(b)原題目條件不足所以假設在data transfer指令中load與store出現比例相等。
lui並不會read memory所以忽略不計

3. Computer Arithmetic:
(1) Suppose you wanted to add four numbers (A, B, E, F) using 1-bit full adders.
There are two approaches to compute the sum as shown in Fig. 2(a) and 2(b):
cascaded of traditional ripple carry adders and carry save adders. If A, B, E, F
are 4-bit numbers, draw the detailed architecture (consists of 1-bit full adders)
of the carry save addition shown in Fig. 2(b).
(2) Assume that the time delay through each 1-bit full adder is 2T. Calculate and
compare the times of adding four 8-bit numbers using the two different
approaches.
(3) Try Booth‘s algorithm for the signed multiplication of two numbers: 2 ten
× –3ten = – 6ten (or 0010two × 1101two = 1111 1010two). Explain the operations
step by step.

Answer:

245
(1) 4-bit carry save addition 4-bit traditional ripple carry adders

(2) (a) 如左上圖所示,CSA的critical path頇經過第1, 2層最右邊以及第三層所有


的full adders,而8-bit CSA第三層full adder數共有8個,因此delay = (2 + 8)
× 2T = 20 T
(b) 如右上圖所示,TRCA的critical path頇經過第1, 2層最右邊以及第三層所
有的full adders,而8-bit TRCA第三層full adder數共有9個,因此delay = (2
+ 9) × 2T =22 T
(3)
Iteration Step Multiplicand Product
0 Initial values 0010 0000 1101 0
10  Prod – Mcand 0010 1110 1101 0
1
Shift right rpoduct 0010 1111 0110 1
01  Prod + Mcand 0010 0001 0110 1
2
Shift right product 0010 0000 1011 0
10  Prod – Mcand 0010 1110 1011 0
3
Shift right product 0010 1111 0101 1
11  No operation 0010 1111 0101 1
4
Shift right product 0010 1111 1010 1

4. Pipelining:
(a) MIPS instructions classically take five steps to execute in pipeline. Please
explain the detailed operations of the five-stage pipeline used in MIPS
instructions.
(b) Fig. 3 shows a pipelined datapath of MIPS processor. Please explain the
function of hazard detection unit and forwarding unit in Fig. 3 and they how
to resolve data hazards.
(c) Dynamic branch prediction is usually used to resolve control hazards.
Consider a loop branch that branches nine times in a row, then is not taken
once. What is the prediction accuracy for this branch when applying 1-bit and
2-bit prediction schemes respectively? (1-bit predictor updates the prediction
bit on a mispredict, a prediction in 2-bit predictor must miss twice before it is
changed)

246
Answer:
(a) 1. IF: Instruction fetch
2. ID: Instruction Decode and register file read
3. EX: Execution or address calculation
4. MEM: Data memory access
5. WB: Write back
(b) EX階段的data hazard可藉由偵測EX/MEM pipeline register中的Rd暫存器欄位
是否和ID/EX pipeline register中Rs或Rt暫存器欄位一致而判斷有無data
hazard。如EX/MEM.RegisterRd = ID/EX.RegisterRs則將第一個Multiplexor的
選擇線設為10以forwarding ALU計算結果。如EX/MEM.RegisterRd =
ID/EX.RegisterRt則將第二個Multiplexor的選擇線設為10以forwarding ALU計
算結果。
MEM階段的data hazard可藉由偵測MEM/WB pipeline register中的Rd暫存器
欄位是否和ID/EX pipeline register中Rs或Rt暫存器欄位一致而判斷有無data
hazard。如MEM/WB.RegisterRd = ID/EX.RegisterRs則將第一個Multiplexor的
選擇線設為01以forwarding ALU計算結果或記憶體資料。如
MEM/WB.RegisterRd = ID/EX.RegisterRt則將第二個Multiplexor的選擇線設
為01以forwarding ALU計算結果或記憶體資料。
Hzard Detection Unit is used to detect whether there is a load-use data hazard
exist between two consecutive instructions. If it is true, stall the pipeline for one
clock.
(c) 1-bit prediction: 因為第一次和最後一次會猜錯因此其正確率為80%
2-bit prediction: 最後一次會猜錯因此其正確率為90%

5. Cache:
(1) How does the control unit deal with cache misses? Please describe the steps to
be taken on an instruction cache miss as clear as possible.
(2) Please explain the function of three portions (Tag, Index, and Block offset) in
the address of Fig. 4. How many total bits are required for the cache?
(3) Assume an instruction cache miss rate for gcc of 2% and a data cache miss
rate of 4%. If a machine has a CPI of 2 without any memory stalls and the
miss penalty is 50 cycles for all misses, determine how much faster a machine
would run with a perfect cache that never missed. Use the instruction
frequencies for gcc from Fig. 1.
(4) Suppose we increase the performance of the machine in the previous question
by doubling its clock rate. Since the main memory speed is unlikely to change,
assume that absolute time to handle a cache miss does not change. Assuming
the same miss rate as the previous question, how much faster will the machine
be with the faster clock?

Answer:

247
(1) The control unit deal with cache misses as follows:
1. Send the original PC value (current PC–4) to the memory
2. Instruct main memory to perform a read and wait for the memory to
complete its access
3. Write the cache entry, putting the data from memory in the data portion of
the entry, writing the upper bits of the address into the tag field, and
turning the valid bit on
4. Restart the instruction execution at the first step, which will refetch the
instruction, this time finding it in the cache.
(2) Tag: contains the address information required to identify whether the
associated block in the hierarchy corresponds to a requested word.
Index: is used to select the block.
Block offset: specify a word within a block.
Total bits = 212 × (1 + 16 + 4 × 32) = 580 Kbits
(3) CPI for non-perfect cache = 2 + 0.02 × 50 + 0.04 × 0.35 × 50 = 3.7
Hence, the machine with perfect cache is faster then with non-perfect cache
by 3.7/2 =1.85 times.
(4) The penalty becomes 100 clock cycles
New CPI = 2 + 0.02 × 100 + 0.04 × 0.35 × 100 = 5.4
3.7
Hence the machine with faster clock is  1.37 faster
5.4 / 2

Frequency
Instruction class MIPS example Average CPI
gcc spice
Arithmetic Add, sub, addi 1.0 48% 50%
Data transfer lw, sw, lb, sb, lui 1.4 35% 41%
Conditional branch beq, bne, slt, slti 1.7 15% 7%
Jump j, jr, jal 1.2 2% 2%
Fig. 1

Fig. 2(a) Fig. 2(b)

248
Hazard ID/EX.MemRead
detection
unit ID/EX

IF/IDWrite
WB
EX/MEM
M
Control u M WB
x MEM/WB
0
IF/ID EX M WB
PCWrite

M
Instruction

u
x
Registers
Instruction Data
PC ALU
memory memory M
u
M x
u
x

IF/ID.RegisterRs
IF/ID.RegisterRt
IF/ID.RegisterRt Rt M EX/MEM.RegisterRd
IF/ID.RegisterRd Rd u
x
ID/EX.RegisterRt Rs Forwarding MEM/WB.RegisterRd
Rt unit

Fig. 3

Address (showing bit positions)


31 16 15 4 32 1 0

16 12 2 Byte
Hit Tag Data
offset
Index Block offset
16 bits 128 bits
V Tag Data

4K
entries

16 32 32 32 32

Mux
32

Fig. 4

249
93 年中山資工

If some questions are unclear or not -well defined to you, you can make your own
assumptions and state them clearly in the answer sheet.
I. 選擇題: Please choose the most appropriate answer (one only) to each following
question.
1. Which of the following metric will not benefit from the forwarding technique
(a) program instruction count (b) execution time (c) data hazard stall (d)
effective CPI.
2. Considering a direct-mapped cache with 64 blocks and block size of 16 bytes,
what block number does memory address (256)10 map to (a) 4 (b) 16 (C) 64 (d)
256
3. Which of the following style of instruction sets is most likely to achieve
smallest instruction count while compiling a program? (a) accumulator (b)
load-store (c) stack (d) memory-memory
4. How many bits of ROM is required to implement a Moore finite-state machine
with 8 states, 2 input and 3 output control signals? (a) 48 (b) 60 (c) 96 (d) 192
5. To get a speed up of 5 from 20 processors, which number is closest to the
minimum percentage of the original program that has to be sequential? (a) 5%
(b) 10% (c) 16% (d) 80%
6. What is the main advantage by adding some complex instructions into the
existed instructions sets? (a) reduced CPI (b) less instruction count (c) faster
clock cycle (d) increased MIPS.
7. Which of the following RISC addressing model will involve the memory
operation? (a) Base addressing (b) Register addressing (c) PC-relative
addressing (d) immediately addressing
8. Which of the following floating point number represented in IEEE 754
standard (1 -bit for sign, 8-bit for exponent) is the largest? (a) 0 11111111
10000000000000000000000 (b) 0 01000000 10000000000000000000000 (c) 1
11000000 10000000000000000000000 (d) 0 10000000
00000000000000000000000
9. Which of the following statement about computer arithmetic is correct? (a)
Basic Booth‘s algorithm can always improve the multiplication speed (b) The
addition of two floating point numbers won‘t overflow (c) The subtraction of
two two‘s complement numbers won‘t overflow (d) The floating-point addition
is not associative.
10. What‘s not the advantage of the addition of second level cache? (a) reduced
miss penalty (b) reduced miss rate of the first level cache (c) reduced effective
CPI (d) reduced program‘s execution time
11. The compiler technique can help improving many metrics except (a) average
CPI (b) clock frequency (c) miss rate (d) control hazard
12. Which of the following feature is not typical for the RISC machine? (a)
powerful instruction set (b) small CPI (c) limited addressing mode (d) simple

250
hardware architecture
13. Which of the following statement about the use of larger block size is correct?
(a) It can always reduce the miss rate. (b) It can reduce the miss rate because of
the temporal locality of the program, (c) It can reduce compulsory misses, (d)
It can reduce the miss penalty.
14. By increasing the pipelined depth of the machine, (a) the execution time can
always be reduced (b) the chance of hazard becomes smaller (c) the average
CPI may increase (d) None of the above are correct.
15. Which of the statements about the single cycle, multicycle and pipelined
implementation of MIPS machine is correct (a) Single cycle machine requires
least amount of hardware (b) Pipelined machine has the smallest effective CPI
(c) Multicycle machine requires least amount of hardware (d) The clock
frequency of the multicycle machine is slowest.

Answer:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
(a) (b) (d) (d) (c) (b) (a) (d) (d) (b) (b) (a) (c) (c) (c)

註(4):每個state要輸出3個control signals和3個next state bits因此ROM size = 2(2+3)


× (3 + 3) = 192
1
註(5): 1  x   5 ,  x =0.1578
x
20
註(8):(a) is not a number

II. Computer Datapath


Fig. 1 shows the complete datapath of the multi-cycle implementation of MIPS
processor.
1. Describe the function of the following segment of the program. ($zero is one
MIPS register and always 0. Register $a1 equals 10 initially.)
add $t0, $zero, zero
loop1: add $t1, $t0, $t0
add $t1, $t1, $t1
add $t2, $a0, $t1
sw $zero, 0($t2)
addi $t0, $t0, 1
slt $t3, $t0, $a1
bne $t3, $zero, loop1
2. Evaluate the CPI of the individual instruction in the above program, and also
calculate the overall CPI for running the entire program.
3. Let‘s suppose the overflow of ALU operation will lead to an exception. When
this exception occurs, the machine will jump to special interrupt routine
services at 0x000C, and the old PC will be written to a specific register $k1.

251
Discuss how to modify the datapath to support this exception‘s behavior.
4. If we pipeline the datapath shown in Fig. 1 to achieve a five-stage pipelined
MIPS, some pipelining hazard may occur while running the above program.
Find out what hazards will occur and how many cycles the machine will stall
related to the execution of the instruction bne $t3, $zero, loop1. (Assume no
data forwarding mechanism is used, and the register read/write can happen in
the same cycle.)
5. One way to cope up with the hazard is to use branch delay slot. Explain this
concept and illustrate how to apply it to the program shown as above.

Answer:
1. Clear 10 elements in an array which starting address is contained in register
$a0
2.
Instruction add sw addi slt bne Overall
CPI 4 4 4 4 3 3.86
(註:Instruction count = 1 + 10 × 7 = 71, total CPU cycles = 4 + 10 × (4 + 4 + 4
+ 4 + 4 + 4 + 3) = 274. Overall CPI = 274/71 = 3.86)
3.

0x000C

$k1

4. (a) Data hazard, control hazard, and structural hazard.


(b) bne與slt間需stall 2 clock cycles以解決data hazard,而bne後需stall 1
clock cycles以解決control hazard。
5. The delayed branch allows one or more instructions following the branch to
be executed in the pipeline whether the branch is taken or not. The instruction

252
following a branch or jump is called the delay slot. Compilers and assemblers
try to place a safe instruction, which is not affected by the branch, in the
branch delay slot.
We can remove the sw instruction to behind the bne instruction as shown
below.
add $t0, $zero, zero
loop1: add $tl, $t0, $0
add $tl, $tl, $tl
add $t2, $a0, $tl
addi $t0, $t0, 1
slt $t3, $t0, $a1
bne $t3, $zero, loop1
sw $zero, 0($t2)

III. Cache
Suppose the gcc program is run on some 200 MHz RISC machine for a two-way
associated unified cache with 64-KB of data and four-word block. This miss rate
of instruction and data memory access is 2% and 6% respectively. The CPI of the
machine is 2 when there is no memory stall. The miss penalty is 40 cycles for all
misses. The instruction mix of the gcc program is 22% load, 8% store, 50%
R-type, 18% branch and 2% jump instructions. The instruction length of this
processor is 32-bit wide.
1. Draw the cache organization and calculate the overall size of the cache
including the tag and the valid bit.
2. Calculate the MIPS of this machine for running the gcc program.
3. In some cache organization, some dirty bit will be used. Discuss the purpose
of the use of dirty bit.
4. Virtual memory can be regarded as another level of the memory hierarchy.
Compare the virtual memory and data cache from the following aspects:
block placement scheme, block replacement strategies and write policy.

Answer:
1. four-word block  4 bits of offset; the no. of a blocks = 64KB/16B = 4K; The
no. of sets = 4K/2 = 2K. Hence, the index field has 11 bits and the tag field =
32 – 11 – 4 = 17 bits.
The total cache size = 2K × 2 × (1 + 17 + 128) = 584 Kbits

253
Address
31 30 12 11 10 9 8 321 0

2217 8 11 offset: 4 bits

Index V Tag Data V Tag Data


0
1
2

2046
2047

128
128

2-to-1 MUX

Hit Data

2. Average CPI = 2 + 0.02 × 40 + 0.06 × (0.22 + 0.08) × 40 = 3.52


200  10 6
MIPS = =56.82
3.52  10 6
3. 在採用write back的cache中當write hit時,cache block的dirty bit會被設為1。
當此cache block要被置換掉時如果dirty bit為1則在被置換掉之前必頇先將
此block寫回memory。
4. Virtual memory的page fault penalty非常高所以儘量減少miss rate是virtual
memory最主要的考量。因此在block placement方面virtual memory會採用
fully associative; block replacement會採用LRU; 而write policy則會採用
write-back。
Cache的miss penalty沒有那麼高因此block placement方面可採用
direct-mapped, set associative, and fully associative. 在block replacement方面
可採用LRU or Randomly replacement. 而在write policy方面採用write-back
or write through皆可。

254
IV. Computer arithmetic
1. Draw the gate-level implementation of a full adder.
2. Draw the detailed architecture of a simple ALU as shown in Fig. 2 capable of
performing 4-bit two‘s complement addition and subtraction. When op equals
1 it will perform the operation of (A+B) while op equals 0, it will perform the
operation of (A-B), This ALU will set the flag f_over (=1) when overflow
occurs. The other flag f_comp will be set when two numbers are equal (i.e.
A=B).

Answer:
op
1. 2.

f_comp
a0
s0

+
b0

a1
s1

+
b1

a2
+ s2
b2
a3
s3
+

b3
f_over
c4

255
-

-
- -
-
-
-
-

Fig. 1

4 4
A = (a3, a2, a1, a0) S = (s3, s2, s1, s0)
4
B = (b3, b2, b1, b0) ALU f_over
op f_comp
Fig. 2

256
92 年中山資工

1. Answer and explain the following questions. Credit will be given only if
explanation is provided.
(1) What is the ―addressing mode‖? Enumerate and explain two addressing
modes that are the frequently used in reduced instruction set computers
(RISCs).
(2) RISCs generally have poor code density (larger code size) compared with
CISCs (complex instruction set computers). Please explain that RISCs how to
reduce code size?
(3) Pentium 4 has a much deeper pipeline than Pentium III. Please explain the
advantage and disadvantage of deeper pipeline.
(4) Is the floating-point addition performed in a computer associative? Why?
(5) Derive the IEEE 754 binary representation for the floating-point
number –10.75ten in single precision
(6) Assume that multiply instructions take 10 cycles and account for 20% of the
instructions in a typical program and that the other 80% of the instructions
require an average of 5 cycles for each instruction. What percentage of time
does the CPU spend doing multiplication?
(7) Assume that in 1000 memory references there are 30 misses in the first-level
(LI) cache and 6 misses in the second-level (L2) cache. The miss penalty from
the L2 cache to memory is 100 clock cycles, the hit time of the L2 cache is 10
clock cycles, the hit time of LI is 1 clock cycle, and there are 1.5 memory
references per instruction. What is the average memory access time? Ignore
the impact of writes.

Answer:
(1) 1. Register addressing, where the operand is a register
add $s0, $s1, $s2 # s0  $s1 + $s2
2. Base or displacement addressing, where the operand is at the memory
location whose address is the sum of a register and a constant in the
instruction
lw $s0, 20($s1) $s0  MEM[20 + $s1]
(2) Compiler optimizations to reduce code size: such as s strength reduction, dead
code elimination, and common subexpression elimination.
Hardware techniques to reduce code size: such as dictionary compression
where identical code sequences are identified and each occurrence is assigned
a variable-length codeword based on the frequency of occurrence. (more
frequently occurring instruction sequences are assigned shorter codewords)
(3) Extending the length of a pipeline can have major benefits to increase speed,
but at the same time having a deeper pipeline will need more complex
circuitry to prevent data hazards and if such hazards cannot be prevented the
overall performance of the CPU will be less.

257
(4) No.
For Example, suppose x = − l.5ten × 1038, y = l.5ten × 1038, and z = 1.0, and
that these are all single precision numbers.
Then x + (y + z) = − l.5ten × 1038 + (l.5ten × 1038 + 1.0) = − l.5ten × 1038 + l.5ten
× 1038= 0.0 (x + y) + z = (− l.5ten × 1038 + l.5ten × 1038) + 1.0 = 0.0 + 1.0 =
1.0
Therefore, x + (y + z) ≠ (x + y) + z
(5) –10.75ten = –1010.11two = –1.01011two × 23
IEEE 754 single precision format = 1 10000010 01011000000000000000000
10  0.2
(6)  0.3333  33.33%
10  0.2  5  0.8
(7) Average memory access time = 1.5 × (1 + 0.03 × 10 + 0.006 × 100) = 2.85
clock cycles

2. (1) There are three methods to implement the datapath: single-cycle (M1),
multicycle (M2), and pipeline (M3). The operation times for the major
functional units in these implementations are 2 ns for memory access, 2 ns for
ALU operation, and 1 ns for register file read or write. Assuming that the
multiplexors, control unit, PC accesses, sign extension unit, and wires have no
delay. The other details of these three, implementations are listed as follows:
M1: The critical path of single-cycle implementation for the different
instruction classes is:
Instruction class Functional units used by the instruction class
ALU type Instruction fetch Register access ALU Register access
Load word Instruction fetch Register access ALU Memory access Register access
Store word Instruction fetch Register access ALU Memory access
Branch Instruction fetch Register access ALU
Jump Instruction fetch
M2: Multicycle implementation uses the control shown in <Fig. 1>.
M3: For the pipelined implementation, assume that half of the load instructions
are immediately followed by an instruction that uses the result, that the
branch delay on misprediction is 1 clock cycle, and that one-quarter of the
branches are mispredicted. Assume that jumps always pay 1 full clock
cycle of delay, so their average time is 2 clock cycles.
If the instruction mix is 23% loads, 13% stores, 43% ALU instructions, 19%
branches, and 2% jumps. Please calculate the average CPI (clock cycles per
instruction) and the average instruction time for the three implementations.
(2) Consider the five instructions in the following program. These instructions
execute on the five-stage pipelined datapath of <Fig. 2>. The five stages are:
instruction fetch (IF), instruction decode and fetch operand (ID), instruction
execution (EX), memory access (MEM), and register write back (WB).
Assume that each stage takes one cycle to complete its execution and the first
instruction starts from clock cycle 1.

258
add $1, $2, $3
add $4, $5, $1
lw $6, 50($7)
sub $8, $6, $9
add $10, $11, $8
(a) At the end of the fifth cycle of execution, which registers are being read
and which register will be written? How many cycles will it take to execute
this program?
(b) Explain what the forwarding unit and the hazard detection unit are doing
during the fifth cycle of execution. If any comparisons are being made,
mention them.

Answer:
(1)
CPI for M1 = 1
CPI for M2 = 5 × 0.23 + 4 × 0.13 + 4 × 0.43 + 3 × 0.19 + 3 × 0.02 = 4.02
CPI for M3 = 1 + 0.23 × 0.5 × 1 + 0.19 × 0.25 × 1 + 0.02 × 1 =1.1825
Average instruction time for: M1 = 1 × 8 ns = 8 ns
M2 = 4.02 × 2 ns = 8.04 ns
M3 = 1.1825 × 2 ns = 2.365 ns
(2)
(a) Register $1 is being written and Registers $6 and $9 are being read.
Since there is load-use data hazard between instructions 3 and 4, one
clock stall is required; therefore, execute this program needs (5 – 1) + 5 +
1 = 10 clock cycles
(b) The forwarding unit is comparing $1 = $7? $4 = $7? $1 = $6? $4 = $6?
The hazard detection unit is comparing $6 = $6? $6 = $9?

3. (1) Suppose we have made the following measurements:


Frequency of floating-point (FP) instructions = 5%
Frequency of ALU instructions = 25%
Average CPI of FP instructions = 20.0
Average CPI of ALU instructions = 4.0
Average CPI of other instructions = 2.0
Assume that the two design alternatives are to decrease the average CPI of FP
instructions to 8.0 or to decrease the average CPI of ALU instructions to 2.0.
Compare the performance of these two design alternatives.
(2) Consider the following three processors with the same instruction architecture:
(a) A simple processor running at a clock rate of 1.2 GHz and achieving a
pipeline CPI of 1.0. This processor has a cache system that yields 0.01
misses per instruction.
(b) A deeply pipelined processor with slightly smaller caches and a 1.5 GHz
clock rate. The pipeline CPI of the processor is 1.2, and the smaller caches

259
yield 0.014 misses per instruction on average.
(c) A speculative superscalar processor. This processor has the smallest caches
and a 1 GHz clock rate. The pipeline CPI of the processor is 0.4, and the
smallest caches lead to 0.02 misses per instruction, but it hides 20% of the
miss penalty on every miss by dynamic scheduling.
Assume that the main memory time (which sets the miss penalty) is 100 ns.
Determine the relative performance of these three processors.

Answer:
(1) (a) CPI for Design 1 = 0.05 × 8 + 0.25 × 4 + 0.7 × 2 = 2.8
CPI for Design 2 = 0.05 × 20 + 0.25 × 2 + 0.7 × 2 = 2.9
The performance of Design 1 is 2.9 / 2.8 = 1.0357 times better than
Design 2. Hence, decreasing the average CPI of FP is better than
decreasing the average CPI of ALU.
(2) (a) machine: cycle time = 1/1.2GHz = 0.83 ns
Average instruction time = (1 + 0.01 × 100/0.83) × 0.83 = 1.83 ns
(b) machine: cycle time = 1/1.5GHz = 0.67 ns
Average instruction time = (1.2 + 0.014 × 100/0.67) × 0.67 = 2.2 ns
(c) machine: cycle time = 1/1.0GHz = 1 ns
Average instruction time = (0.4 + 0.02 × 0.8 × 100) × 1 = 2 ns
The performance relationship of machines (a) (b) (c) is (a) > (c) > (b).

4. (1) Suppose the branch frequencies (as percentages of all instructions) as follows:
Conditional branches 20%
Unconditional branches 1%
Conditional branches 60% are taken
Consider a five-stage pipelined machine where the branch is resolved at the
end of the third cycle for conditional branches and at the end of the second
cycle for unconditional branches. Assuming that only the first pipe stage can
always be done independent of whether the branch goes and ignoring other
pipeline stalls, how much faster would the machine be without any branch
hazards?
(2) A superscalar MIPS processors can issue two instructions per clock cycle. One
of the instructions could be an integer ALU operation or branch, and the other
could be a load or store.
(a) Enumerate the possible extra resources required for extending a simple
MIPS pipeline into the superscalar pipeline so that it wouldn‘t be hindered
by structural hazards.
(b) Unroll the following loop once under the assumption that the loop index is
a multiple of two. How would this unrolled loop be scheduled on the
superscalar pipeline for MIPS? Reorder the instructions to avoid as many
pipeline stalls as possible and show your unrolled and scheduled code as
the following table.

260
Loop: lw $t0, 0($s1) ALU or branch Data transfer Clock cycle
addu $t0, $t0, $s2 Loop: 1
sw $t0, 0($s1) 2
addi $s1, $s1, -4 3
bne $s1, $zero, Loop •••

Answer:
(1) The penalty for conditional branch hazard = 2 clock cycles
The penalty for unconditional branch hazard = 1 clock cycle
The average CPI considering branch hazard = 1 + 0.2 × 0.6 × 2 + 0.01 × 1 =
1.25
The machine without branch hazard is 1.25 times faster than the machine with
branch hazard.
(2) (a) Extra hardware include:
Register file: 2 read for ALU, 2 read for store, one write for ALU, one write
for load
Separated adder for address calculation of data transfers
(b)
Loop: lw $t0, 8($s1)
addu $t0, $t0, $s2
sw $t0, 8($s1)
lw $t1, 4($s1)
addu $t1, $t1, $s2
sw $t1, 4($s1)
addi $s1, $s1, -8
bne $s1, $zero, Loop

ALU or branch instruction Data transfer instruction Clock cycle


Loop: addi $s1, $s1, -8 lw $t0, 0($s1) 1
lw $t1, 4($s1) 2
addu $t0, $t0, $s2 3
addu $t1, $t1, $s2 sw $t0, 8($s1) 4
bne $s1, $zero, Loop sw $t1, 4($s1) 5

261
Fig. 1

Hazard ID/EX.MemRead
detection
unit ID/EX
IF/IDWrite

WB
EX/MEM
M
Control u M WB
x MEM/WB
0
IF/ID EX M WB
PCWrite

M
Instruction

u
x
Registers
Instruction Data
PC ALU
memory memory M
u
M x
u
x

IF/ID.RegisterRs
IF/ID.RegisterRt
IF/ID.RegisterRt Rt M EX/MEM.RegisterRd
IF/ID.RegisterRd Rd u
x
ID/EX.RegisterRt Rs Forwarding MEM/WB.RegisterRd
Rt unit

Fig. 2

262
96 年政大資科

1. [ADC] The successive approximation converter is one of the most widely used
types of ADC. Given the following simplified block diagram:
(a) Explain how the control logic works using a flowchart.
(b) Assume VA = 10.4V, use a simple four-bit converter with a step size of 1V to
illustrate the conversion process.

Answer:
(a)
Start

Clear all bits

Start at MSB

Set bit = 1

Is DAC output Yes Clear bit


> VAX back to 0
No

Clear bit No Is DAC output


back to 0 > VAX
Yes
Conversion finished,
number in register

End

(b) 10.4 > 8  b3 = 1


10.4 – 8 = 2.4 < 4  b2 = 0
2.4 > 2  b1 = 1
2.4 – 2 = 0.4 < 1  b0 = 0

263
2. [Instruction Set Architecture, Cache, Performance]
(a) One of the differences between RISC architectures and CISC architecture is
supposed to be the reduced types of instructions available. A student thinks it
would be a good idea to simplify the instruction set even more to remove the
special case instructions that take immediate operands such as "li", "addi", etc.
Explain to him/her why this might not be such a good idea.
(b) Explain how a memory system that pages to secondary storage depends on
locality of reference for efficient operation.
(c) Program A consists of 2000 consecutive add instructions, while program B
consists of a loop that executes a single add instruction 2000 times. You run
both programs on a certain machine and find that program B consistently
executes faster. Explain.

Answer:
(a) Common operations will now require multiple instructions without any
corresponding improvement in cycle time
(b) Without locality, the memory system would perform a disk speeds as every
access could require an access to disk. A working set that fits within physical
memory for efficient operations.
(c) Program B fits easily in the instruction cache but program A takes more time
to be fetched.

3. [Adder] A half-adder takes two input bits A and B to produce sum (S) and carry
(Cout) outputs.
(a) Use basic logic gates to construct the circuits for a half adder.
(b) Use exactly two half-adders and one OR gate to construct a full adder.

Answer:
(a) Truth table for half adder Circuit
A B S Cout
A
0 0 0 0 S = AB
B
0 1 1 0 Cout = AB
1 0 1 0 Cout
1 1 0 1

(b) Truth table for full adder


A B Cin S Cout
0 0 0 0 0 S = (AB)C
0 0 1 1 0
0 1 0 1 0

264
0 1 1 0 1 Cout  AB Cin  ABCin  ABCin  ABC in
1 0 0 1 0    
 AB Cin  Cin  AB  AB Cin
1 0 1 0 1  AB   A  B Cin
1 1 0 0 1
1 1 1 1 1
Circuit

A
B S

Cout

Cin

4. [Pipelining] Refer to the following figure, if the time for an ALU operation can be
shortened by 25%; (a) Will it affect the speedup obtained by pipelining? If yes,
by how much? Otherwise, why? (b) What if the ALU now takes 25% more time?

Answer:
(a) Shortening the ALU operation will not affect the speedup obtained from
pipelining. It would not affect the clock cycle, which will be determined by
the operation that takes the most time, i.e., instruction fetch or data access.
(b) If the ALU operation takes 25% more time, it becomes the bottleneck in the
pipeline. The clock cycle needs to be 250 ps.
Suppose the original instruction time = x
x / 250
The change of speedup = = 0.8
x / 200
So, the speedup would be 20% less.

265
5. [Memory] Give a computer system that features:
• a single processor
• 32-bit virtual addresses
• a cache of 210 sets that are four-way set-associative and have 8-byte blocks
• a main memory of 226 bytes;
• a page size of 212 bytes.

Main memory
CPU Cache Page Table

Disk

(a) Does this system cache virtual or physical addresses?


(b) How many bytes of data from memory can the cache hold? (excluding tags)
(c) In the cache, each block of data must have a tag associated with it. How many
bits long are these tags?
(d) How many comparators are needed to build this cache while allowing single
cycle access?
(e) At any one time, what is the greatest number of page-table entries that can
have their valid bit set to 1?

Answer:
(a) Virtual
(b) 210  4  8 bytes =32 Kbytes
(c) 32 – 10 – 3 = 19 bits
(d) 4 comparators
(e) 232/212 bytes/page = 220 pages.
All of them can valid because they can all alias the same physical page.

266
95 年政大資科

1. [I/O]
(1) Both networks and buses connect components together. Which of the following
are true about them?
(a) Networks and I/O buses are almost always standardized.
(b) Shared media networks and multimaster buses need an arbitration scheme.
(c) Local area networks and processor-memory buses are almost always
synchronous.
(d) High-performance networks and buses use similar techniques compared to
their lower-performance alternatives: they are wider, send many words per
transaction, and have separate address and data lines.
(2) In ranking of the three ways of doing I/O, which statements are true?
(a) If we want the lowest latency for an I/O operation to a single I/O device, the
order is polling, DMA, and interrupt driven.
(b) In terms of lowest impact on processor utilization from a single I/O device,
the order is DMA, interrupt driven, and polling

Answer:
(1) (a), (b)
(2) (a), (b)

2. [RAID] What does RAID stand for? Regarding RAID levels 1, 3, 4, 5, and 6,
which one has the highest check disk overhead? Which one has worst throughput
for small writes?

Answer:
(1) RAID (Redundancy Arrays of Inexpensive Disks): An organization of disks
that uses an array of small and inexpensive disks so as to increase both
performance and reliability.
(2) RAID 1 has the highest check disk overhead.
(3) For small writes, RAID 3 has the worst throughput.

3. [Memory Hierarchies]
(1) Which of the following statements (if any) are generally true?
(a) There is no way to reduce compulsory misses.
(b) Fully associate caches have no conflict misses.
(c) In reducing misses, associativity is more important than capacity.
(2) A new processor can use either a write-through or write-back cache selectable
through software.
(a) Assume the processor will run data intensive applications with a large
number of load and store operations. Explain which cache write policy
should be used.

267
(b) Consider the same question but this time for a safety critical system in
which data integrity is more important.

Answer:
(1) (b)
(2) (a) For the data-intensive application, the cache should be write-back. A write
buffer is unlikely to keep up with this many stores, so write-through
would be too slow.
(b) For the safety-critical system, the processor should use the write-through
cache. This way the data will be written in parallel to the cache and main
memory, and we could reload bad data in the cache from memory in the
case of an error.

4. [Multiprocessors, Amdahl‘s law] A program takes TS seconds when executed on a


single CPU. Now assume that we have p processors which can be used for
parallel processing. (a) If only a fraction f of the program can be speeded up to
take advantage of parallel processing, what is the speedup S? (b) Now assume
that the improvement in performance by using p processor can be formulated as
p(1 – px) for the parallelizable portion, find the value p that will maximize the
overall speedup.

Answer:
1
(a) S 
1  f  
f
p
(b) Make differential of the equation p(1 – px) and let the result equal to 0. Then
we have p = 1/2x. That is, when p = 1/2x, the equation p(1 – px) will has
maximum value.

5. [Floating-Point Representation] Consider a shorten version of the IEEE standard


floating point format with only 12 bits: one sign bit, 5 bits for the exponent, and 6
bits for the significand.
(a) Represent 1.5 and -0.75 with this format.
(b) What is the range of numbers it could represent? (excluding denormalized
numbers.)

Answer:
(a) 1.5ten = 1.1two = 1.1two  20,  floating point format = 0 01111 100000
-0.75ten = -0.11two = -1.1 two  2-1,  floating point format = 1 01110 100000
(b)  1.000000  2-14 to  1.111111  215,  0,   , NaN

268
94 年政大資科

1. Use the following three operations to process 10010011, and choose the correct
result.
(1) Logical right shift: (A)10010011 (B)11100100 (C) 00100100 (D)01001001
(E) 11001001.
(2) Arithmetic right shift: (A) 11001001 (B) 01001001 (C) 10010011 (D)
11100100 (E) 00100100.
(3) Right rotate: (A) 10010011 (B) 11100100 (C) 11001001 (D) 00100100 (E)
01001001.

Answer:
(1) (D)
(2) (A)
(3) (C)

2. Which of the following operation is equivalent to division by 2 in two‘s


complement notation, when there is no overflow or underflow?
(A) arithmetic right rotate (B) arithmetic right shift (C) arithmetic left shift (D)
arithmetic left rotate (E) left rotate.

Answer: (B)
註:此題無解,勉強可選(B)

3. Which of the following situation cannot be a binding time? (A)When a program


is written (B) When a base register used for addressing is loaded (C)When the
instruction containing the address is executed (D)When the program is translated
(E) none of the above.

Answer: (A)

4. Addressing modes constitute a very important topic when people discuss


alternative designs of CPUs. Common addressing modes include register,
immediate, direct, and PC-relative mode, etc. Find a wrong statement from the
following choices. (A) Every CPU supports the direct addressing mode. (B) In
practice, having immediate, direct, register, and indexed mode is enough for
almost all applications. (C) Compilers are in charge of finding the best addressing
modes for statements written in high-level languages. (D) When offering a
limited number of addressing modes, the architecture must make sure that
common applications will be computable. (E) none of the above.

Answer: (A)

269
5. Assume that you are designing the instruction format for a strange new CPU that
will have 16 user-accessible registers. Further assume (1) that all instructions will
be encoded in exactly 16 bits and (2) that your boss wants you to include as many
instructions that employ register addressing as possible. If each instruction must
allow users to use at least two registers, how many different instructions can you
get? Which of the following choice is impossible?
(A) 64 (B) 128 (C) 256 (D) 512 (E) none of the above.

Answer: (D)

6. Which is not a possible effect of increasing the degree of associativity in the


design of cache? (A) decreasing the miss rate (B) increasing the hit time (C)
requiring more comparators (D) avoiding the needs to bind actual addresses to
variable names (E) none of the above.

Answer: (D)

7. Assume a cache of 2K blocks and a 16-bit address. Let  and , respectively, be


the total number of sets and the total number of tag bits for caches that are
two-way set associate. Let  and , respectively, be the total number of sets and
the total number of tag bits for caches that are fully associative. Compute / and
/. You must show the computation process for getting your answers to get
credits.

Answer:
= 2K / 2 = 1K = 210
=210 × 2 × 6 = 6 × 211
(註: 題目條件不足無法直接計算tag,如忽略offset則tag field = (16 − 10) = 6
bits)
= 1
= 1 × 2K × 16 = 32K = 215
/= 1K= 210; /= 6 × 211/215 = 0.375

8. In the following C program fragment, which types of hazards might occur in a


pipelined machine? Explain your answers.
If (a = = b){
x = y + z;
w = x-1;}
r = w + x;

Answer:

270
Control hazard及data hazard
假設compiler分配給變數a, b, x, y, z, w的暫存器分別為$a0, $a1, $s0, $s1, $s2,
$s3則此C program有可能被組譯為
1 bne $a0, $a1, L1
2 add $s0, $s1, $s2,
3 addi $s3, $s0, -1
4 L1: add $s0, $s3, $s0
此時Line 1指令會有control hazard而如果a = b成立則Lines (2,3), (2,4), (3,4)會
有data hazard

9. Which of the following architecture/model for multiprocessors systems is most


unlikely to adopt techniques of critical sections that are commonly discussed in
the course of Operating Systems? Explain your answer.
(A) uniform memory access (B) symmetric multiprocessors (C) nonuniform
memory access (D) message passing.

Answer: (D)

10. Explain the main difference between the actual meanings represented by the
following paired terms.
(1) CISC vs. RISC
(2) Big endian vs. little endian
(3) Programmed I/O vs. DMA

Answer:
(1) CISC stands for complex instruction set computer and is the name given to
processors that use a large number of complicated instructions, to try to do
more work with each one.
RISC stands for reduced instruction set computer and is the generic name
given to processors that use a small number of simple instructions, to try to do
less work with each instruction but execute them much faster.
(2) Big endian: the most significant byte of any multibyte data field is stored at
the lowest memory address.
Little endian: the least significant byte of any multibyte data field is stored at
the lowest memory address.
(3) Programmed I/O: CPU has to write/read one byte at a time between main
memory and device. This takes up a lot of CPU time
DMA: DMA controller does programmed I/O on behalf of CPU so that CPU
can do other work

271
93 年政大資科

1. Acronyms:
Example: DMA  Direct Memory Access (1) VHDL (2) SoC (3) TLB (4) RAID
(5) NUMA. (Within the context of computer architecture.)

Answer:
(1) VHDL: Very high speed integrated circuit Hardware Description Language
(2) SoC: System on a Chip
(3) TLB: Translation-Lookaside Buffer
(4) RAID: Redundant Array of Inexpensive Disks
(5) NUMA: NonUniform Memory Access

2. Number System, IEEE 754:


(1) Determine the base of the number system for the following operation to be
correct: 23 + 44 + 13 + 32 = 222
(2) Find the smallest normalized positive floating-point number in the IEEE-754
single-precision representation.

Answer:
(1) Suppose base is B then (2B + 3) + (4B + 4) + (1B + 3) + (3B + 2) = 2B2 + 2B
+ 2  B2 – 4B – 5 = 0  B = 5. So the base for the number system is 5.
(註: 原題目和為223無解,若改為222則答案為5)
(2) 1.0 × 2-126 = 0 00000001 00000000000000000000000

3. Cache:
Consider data transfers between two levels of hierarchical memory. If we
logically organize all data stored in the lower level into blocks and store the
most-frequently-used blocks in the upper level. Assume that the hit rate is H, the
upper level latency is Tu, and the lower level latency is T1, and the miss penalty is
Tm.
(1) What is the average memory latency?
(2) What is the speedup by using the hierarchical memory system?

Answer:
(1) Average memory latency = Tu + (1 – H) × Tm
(2) Speedup = Tl / (Tu + (1 – H) × Tm)

4. Pipelining:
(1) Assuming no hazards, a pipelined processor with s stages (each stage takes 1
clock cycle) can execute n instructions in ____ clock cycles.
(2) Use the result in (1) to show that the ideal pipeline speedup equals the number
of stages.

272
Answer:
(1) (s – 1) + n
(2) speedup = Execute time before enhancement/Execute time after enhancement
= (n × s) / (s – 1 + n)
if n  ∞, then speed = (n × s) / n = s

5. Approximation Circuits:
In a traditional adder design, the calculation must consider all input bits to obtain
the final carry out. However, in real programs, inputs to the adder are not
completely random and the effective carry chain is much shorter for most cases.
Instead of designing for the worst-case scenario, it is possible to build a faster
adder with a much shorter carry chain to approximate the result. Suppose we
consider only the previous k inputs (lookahead k-bits) instead of all previous
input bits to estimate the i-th carry bit ci, i.e.,
ci = f (ai-1, bi-1, ai-2, bi-2, …, ai-k, bi-k) where 0 < k < i + 1 and aj, bj = 0 if j < 0.
(1) With random inputs, show that ci will generate correct a result, with a
1
probability of 1  k  2 .
2
(2) What is the probability of having a correct ‗carry‘ result considering only k
previous inputs for an N-bit addition?
(3) Design a logic circuit to detect when the approximate adder will generate an
incorrect carry result for the i-th carry bit.

Answer:
(1) (ai, bi)兩個位元至少有一個 1 便可以傳遞前一級(i – 1)的進位,而(ai, bi)剛好有
一個 1 的機率=1/2 (因為樣本為 00, 01, 10, 11),兩個都是 1 可以產生進位機
率為 1/4。
考慮前面 k 級(i – 1 to i – k)無法產生進位而由第 i – k – 1 級產生且(i – 1 to i –
1 k 1 1
k)每一級剛好有 1 可以傳遞這個進位的機率為 ( )   k  2 。因此,
2 4 2
1
1− k  2 便是前面 k 級可以產生正確進位的機率(因為只考慮 k 級若 k 級前有
2
進位而這 k 級又能傳遞,這時錯誤便會發生)。
(2) Because there are totally (N – K – 1) K-stages in the N bits, so the probability
N  k 1
 1 
= 1  k  2 
 2 
(3) 當這 k 級前若產生進位且這 k 級都能傳遞時,錯誤便會產生[就是(aj, bj), i − 1
>= j >= i − k, 兩個位元至少有一個 1]
The error detect function = ai 1  bi 1 ai  2  bi  2 ...ai k  bi k ci k 

273
92 年政大資科

1. Cache:
Suppose that you have a computer system with the following properties:
Instruction miss rate (IMR): 1.5%
Date miss rate (DMR): 4.0%
Percentage of memory instructions (MI): 30%
Miss penalty (MP): 60 cycles
Assume that there is no penalty for a cache hit. Also assume that a cache block is
one-word (32 bits).
(1) Express the number of CPU cycles required to execute a program with K
instructions (assuming that CPI = 1) in terms of the miss rates, miss
percentage and miss penalty.
(2) You are allowed to upgrade the computer with one of the following
approaches:
(a) Get a new processor that is twice as fast as your current computer. The
new processor‘s cache is twice as fast too, so it can keep up with the
processor.
(b) Get a new memory that is twice as fast.
What is a better choice? Explain with a detailed quantitative comparison
between the two choices.

Answer:
(1) The number of CPU cycles = K × (1 + 0.015 × 60 + 0.04 × 0.3 × 60) = 2.62K
(2) (a) The CPI for new processor = 0.5 + 0.015 × 60 + 0.04 × 0.3 × 60 = 2.12
(b) The CPI for new memory = 1 + 0.015 × 30 + 0.04 × 0.3 × 30 =1.81
So, (b) is a better choice.

2. Floating-point Arithmetic :
(1) Why is biased notation used in IEEE 754 representation?
(2) Write the binary representation for the smallest negative floating point value
greater than -1 using single precision IEEE 754 format.
(3) Illustrate the key steps in floating-point multiplication using the example:
0.510 × (− 0.437510).

Answer:
(1) 多數浮點數運算需要先比較指數大小來決定要對齊哪一個數的指數,使
用bias表示法可以直接比較指數欄位的無號數值即可判斷兩個浮點數指
數的大小而不頇考慮其正負號,因此可以加快比較速度。
(2) − 1 = 1 01111111 0000000000000000000000 So, the smallest negative
floating point value greater than − 1 is 1 01111110 1111111111111111111111
(註: – 1大於– 2)
(3) In binary, the task is 1.0002 × 2-1 times −1.1102 × 2-2

274
Step 1: adding the exponents
(−1 + 127) + (−2 + 127) −127 = 124
Step 2: Multiplying the significands:
1.0002 × 1.1102 = 1.1100002 × 2-3, but we need to keep it to 4 bits,
so it is 1.1102×2-3
Step 3: The product is already normalized and, since
127 ≥ −3 ≥ −126, so, no overflow or underflow
Step 4: Rounding the product makes no change:
1.1102×2-3
Step 5: make the sign of the product negative
−1.1102×2-3
Converting to decimal: −1.1102×2-3= −0.0011102= −0.001112= −7/2510=
−0.2187510

3. Parallel Computing :
(1) What are the two possible approaches for parallel processors to share data?
(2) Outline Flynn‘s taxonomy of parallel computers.
(3) Suppose you want to perform two sums: one is a sum of two scalar variables
and one is a matrix sum of a pair of two-dimensional arrays, size 500 by 500.
What speedup do you get with 500 processors?

Answer:
(1) Single address space and Message passing
(2) 1. Single instruction stream, single data stream (SISD)
2. Single instruction stream, multiple data streams (SIMD)
3. Multiple instruction streams, single data stream (MISD)
4. Multiple instruction streams, multiple data streams (MIMD)
1  500  500
(3) Speedup = 500  500 = 499
1
500

4. Disk I/O :
Suppose we have a magnetic disk with the following parameters.
Controller overhead 1 ms
Average seek time 12 ms
# sectors per track 32 sectors/track
Sector size 512 bytes
(1) If the disk‘s rotation rate is 3600 RPS. What is the transfer rate for this disk?
(2) What is the average time to read or write an entire track (16 consecutive kB).
If the disk‘s rotation rate is 3600 RPM? Assume sectors can be read or written
in any order.
(3) If we would like an average access time of 21.33 ms to read or write 8
consecutive sectors (4k bytes), what disk rotation rate is needed?
Answer:

275
(1) Data transfer rate = 3600 × 32 × 512 = 57600KB/sec
(2) Access time = 12 ms + 1/60 + 1 ms
= 12 ms + 16.67 ms + 1 ms = 29.67 ms
(3) Suppose the rotation rate is R cycles per second
Rotation time = 21.33 ms = 12 ms + 0.5/R + 8/(R×32) + 1 ms
 R = 90.036 RPS
註:(2)並不考慮 rotaional time,因為要讀一整個 track 且從那個 sector 讀都可以。

5. Pipelining :
(1) Explain the three types of hazards encountered in pipelining. (Note: State the
causes and possible solutions.)
(2) What are the characteristics of the MIPS instruction set architecture (ISA) that
facilitates pipelined execution? (Note: State at least two properties.)

Answer:
(1) 1. Structural hazards: hardware cannot support the instructions executing in
the same clock cycle (limited resources)
2. Data hazards: attempt to use item before it is ready (Data dependency)
3. Control hazards: attempt to make a decision before condition is evaluated
(branch instructions)
Type Solutions
Structure 增加足夠的硬體 (例如: use two memories, one for instruction
hazard and one for data)
Software solution: (a)使用compiler插入足夠的no operation
(nop)指令。(b)重排指令的順序使data hazard情形消失。
Hardware solution: (a)使用Forwarding前饋資料給後續有data
Data hazard
hazard的指令。(b)若是遇到load-use這種無法單用Forwarding
解決的data hazard時則頇先暫停(stall)一個時脈週期後再使用
Forwarding
Software solution: (a)使用compiler插入足夠的no operation
(nop)指令。(b)使用Delay branch。
Hardware solution: (a)將分支判斷提前以減少發生control
Control hazard hazard時所需要清除指令的個數。(b)使用預測方法(static or
dynamic)。當預測正確時pipeline便可全速運作其效能就不會因
為分支指令而降低。當預測不正確時我們才需要清除掉
pipeline中擷取錯誤的指令。
(2) 1. Instructions are the same length
2. Has only a few instruction formats, with the same source register field
located in the same place in each instruction
3. Memory operands only appear in loads or stores
4. Operands must be aligned in memory

276
96 年暨南資工

1. Please explain the following concepts or terminologies:


(a) The concept of store-program computer. (b) Amdahl‘s law. (c) Branch delay
slot. (d) Miss penalty of cache. (e) Page table.

Answer:
(a) The idea that instructions and data of many types can be stored in memory as
numbers, leading to the stored program computer.
(b) A rule stating that the performance enhancement possible with a given
improvement is limited by the amount that the improved feature is used.
(c) The slot directly after a delayed branch instruction, which in the MIPS
architecture is filled by an instruction that does not affect the branch.
(d) The time to replace a block in cache with the corresponding block from
memory, plus the time to deliver this block to the processor.
(e) The table containing the virtual to physical address translations in a virtual
memory system.

2. For a two-bit adder that implements two‘s complement addition, answer the
following questions: (Assume no carry in)
(a) Write all possible inputs and outputs for the 2-bit adder.
(b) Indicate which inputs result in overflow.

Answer: (a), (b)


Input Sum
Overflow Remark
a1 a0 b 1 b0 s1 s0
0 0 0 0 0 0 No 0+0=0
0 0 0 1 0 1 No 0+1=1
0 0 1 0 1 0 No 0 + (−2) = −2
0 0 1 1 1 1 No 0 + (−1) = −1
0 1 0 0 0 1 No 1+0=1
0 1 0 1 1 0 Yes 1 + 1 = −2
0 1 1 0 1 1 No 1 + (−2) = −1
0 1 1 1 0 0 No 1 + (−1) = 0
1 0 0 0 1 0 No −2 + 0 = −2
1 0 0 1 1 1 No −2 + 1 = −1
1 0 1 0 0 0 Yes −2 + (−2) = 0
1 0 1 1 0 1 Yes −2 + (−1) = 1
1 1 0 0 1 1 No −1 + 0 = −1

277
1 1 0 1 0 0 No −1 + 1 = 0
1 1 1 0 0 1 Yes −1 + (−2) = 1
1 1 1 1 1 0 No −1 + (−1) = −2

3. Convert the following C language into MIPS codes.


clear1(int array[ ], int size)
{
int i;
for (i = 0; i < size; i += 1)
array[i] = 0;
}

Answer:
Suppose that $a0 and $a1 contain the starting address and the size of the array,
respectively.
move $t0, $zero
1oop1: sll $t1, $t0, 2
add $t2, $a0, $t1
sw $zero, 0($t2)
addi $t0, $t0, 1
slt $t3, $t0, $a1
bne $t3, $zero, loop1

4. Explain the five MIPS addressing modes.

Answer:
Multiple forms of addressing are generically called addressing modes. The MIPS
addressing modes are the following:
1. Register addressing, where the operand is a register.
2. Base or displacement addressing, where the operand is at the memory location
whose address is the sum of a register and a constant in the instruction.
3. Immediate addressing, where the operand is a constant within the instruction
itself.
4. PC-relative addressing, where the address is the sum of the PC and a constant
in the instruction.
5. Pseudodirect addressing, where the jump address is the 26 bits of the
instruction concatenated with the upper bits of the PC.

278
5. If we wish to add the new instruction jr (jump register), explain any necessary
modification to the following datapath.
0
M
u
x
ALU
Add result 1
Add Shift PCSrc
RegDst left 2
4 Branch
MemRead
Instruction [31 26] MemtoReg
Control
ALUOp
MemWrite
ALUSrc
RegWrite

Instruction [25 21] Read


PC Read register 1
address Read
Instruction [20 16] data 1
Read
register 2 Zero
Instruction
_ 0 Registers Read ALU ALU
[31 0] Write 0 Read
M data 2 result Address 1
Instruction u register M data
u M
memory Instruction [15 11] x u
1 Write x Data x
data 1 memory 0
Write
data

Instruction [15 0] 16 32
Sign
extend ALU
control
Instruction [5 0]

Answer:
A modification to the datapath is necessary to allow the new PC to come from a
register (Read data 1 port), and a new signal (e.g., JumpReg) to control it through
a multiplexor as shown in the following Figure.

JumpReg

279
95 年暨南資工

1. The single-cycle datapath for the MIPS architecture is shown below.


PCSrc

M
Add u
x
4 Add ALU
result
Shift
left 2
left
Registers
Read 4 ALU operation
MemWrite
Read register 1 ALUSrc
PC Read
address Read data 1 MemtoReg
register 2 Zero
Instruction ALU ALU
Write Read Address Read
register M result data
data 2 u M
Instruction u
memory Write x Data x
data
Write memory
RegWrite data
16 32
Sign
extend MemRead

(a) The single-cycle datapath is not used in modern designs. Why? Please explain
in detail.
(b) What is a multicycle datapath design? Modify the above single cycle datapath
as a multicycle datapath. Draw the modified datapath and explain your
modification.
(c) What is a pipelining implementation? Modify the above single cycle datapath
as a pipelined datapath. Draw the modified datapath and explain your
modification.

Answer:
(a) The single-cycle datapath is inefficient, because the clock cycle must have the
same length for every instruction in this design. The clock cycle is determined
by the longest path in the machine, but several of instruction classes could fit
in a shorter clock cycle.
(b) A multicycle datapath is an implementation in which an instruction is
executed in multiple clock cycles.

280
Compare to the datapath of single-cycle design, the multicycle datapath is
modified as follows:
1. A single memory unit is used for both instruction and data.
2. There is a single ALU, rather than an ALU and two adders.
3. One or more registers are added after every major functional unit to hold
the output of that unit until the value is used in a subsequent clock cycle.
(c) In a pipelining implementation, multiple instructions are overlapped in
execution, much like to an assembly line.

We separate the single-cycle datapath into five pieces, with each


corresponding to a stage of instruction execution. Pipeline registers are added
between two adjacency stages to hold data so that portions of the datapath
could be shared during instruction execution.

2. Please explain the designs and advantages for RAID 0, 1, 2, 3, 4, respectively.

Answer:
RAID
Design description Advantages
level
1. Best performance is achieved
This technique has striping but when data is striped across
no redundancy of data. It offers multiple disks
0
the best performance but no 2. No parity calculation overhead
fault-tolerance. is involved and easy to
implement
Very high availability can be
1 Each disk is fully duplicated.
achieved
This type uses striping across
Relatively simple controller
disks with some disks storing
2 design compared to RAID levels
error checking and correcting
3,4 & 5
(ECC) information.

281
This type uses striping and Very high read and write data
3 dedicates one drive to storing transfer rate since every read and
parity information. writes go to all disks
RAID 4 differs from RAID 3
Better for small read (just one
4 only in the size of the stripes sent
disk) and small writes (less read)
to the various disks.

3. An eight-block cache can be configured as direct mapped, two-way set


associative, and fully associative. Draw these cache configurations. Explain the
relationship between cache miss rate and associativity.

Answer: (1)

Direct-mapped Two-way associative

Block Tag Data Set Tag Data Tag Data


0 0
1 1
2 2
3 3
4
5
6
7
Fully associative
Tag Data Tag Data Tag Data Tag Data Tag Data Tag Data Tag Data Tag Data

(2) Increase associativity will decrease cache miss rate due to conflict misses.

282
94 年暨南資工

1. Translate the following C segment into MIPS assembly code.


while (data[i] = = k)
i = i + j;
Assume: base address of array data is in $a0, size of each element in data is 4
bytes, and i, j, k correspond to $s0, $s1, $s2 respectively.
(1) Use both a conditional branch and an unconditional jump in the loop.
(2) Use only one branch or jump in the loop.

Answer:
(1) Loop: sll $t1, $s0, 2 (2) Loop: sll $t1, $s0, 2
add $t1, $t1, $a0 add $t1, $t1, $a0
lw $t0, 0($t1) lw $t0, 0($t1)
bne $t0, $s2, Exit add $s0, $s0, $s1
add $s0, $s0, $s1 beq $t0, $s2, Loop
j Loop sub $s0, $s0, $s1
Exit:

2. Explain structural hazard, data hazard and branch hazard in the pipeline design.
Give an illustrative example for each of them.

Answer:
(1) 1. Structural hazards: hardware cannot support the instructions executing in
the same clock cycle (limited resources)
2. Data hazards: attempt to use item before it is ready (Data dependency)
3. Control hazards: attempt to make a decision before condition is evaluated
(branch instructions)
(2) Example program:
1 lw $5, 50($2)
2 add $2, $5, $4
3 add $4, $2, $5
4 beq $8, $9, L1
5 sub $16, $17, $18
6 sw $5, 100($2)
7 L1:
Type Example
假設上列指令是在只有單一記憶體的 datapath 中執行,則在第 4
Structure 個時脈週期,指令 1 讀取記憶體資料同時指令 4 也在從同一個
hazard 記憶體中擷取指令,也就是兩個指令同時對一記憶體進行存
取。在這樣情形下就會發生 structural hazard
Data hazard 指令 1 在第 5 個時脈週期時才會將計算結果寫回暫存器$5,但

283
指令 2 和 3 分別在時脈週期 3 跟 4 時便需要使用暫存器$5 的內
容。此時指令 2 和 3 擷取到暫存器$5 的內容並非最新的,因此
在這樣情形下就會發生 data hazard
指令 4 要在 pipeline 的 MEM 階段完成後才知道此分支指令是否
Control 發生,若分支發生則擷取指令 7,若不發生則擷取指令 5。但當
hazard 指令 4 在 MEM 階段時指令 5,6 也早已於 EX,ID 階段內執行。
若指令 4 發生分支則此時 pipeline 就會發生 control hazard

3. Rewrite –100ten using a 16-bit binary representation. Use


(1) sign and magnitude representation
(2) one‘s complement representation
(3) two‘s complement representation

Answer:
(1) sign and magnitude representation: 1000000001100100
(2) one‘s complement representation: 1111111110011011
(3) two‘s complement representation: 1111111110011100

4. We have the following statistics for two processors M1 and M2 (they have the
same classes of instructions):
Instruction class CPI Frequency Instruction class CPI Frequency
A 5 25% A 3 40%
B 2 40% B 3 35%
C 3 35% C 4 25%
M1: 200 MHz M2: 250 MHz
* CPI = clock cycles per instruction, Frequency: occurrence frequency of
the instruction class
(1) Calculate the average CPI for the two processors.
(2) Calculate the MIPS (Million Instructions Per Second) for them.
(3) Which machine is faster? How much faster?

Answer:
(1) Average CPI for M1 = 5 × 0.25 + 2 × 0.4 + 3 × 0.35 = 3.1
Average CPI for M2 = 3 × 0.4 + 3 × 0.35 + 4 × 0.25 = 3.25
200  10 6 250  10 6
(2) MIPS for M1 = = 64.52, MIPS for M2 = = 76.92
3.1  10 6 3.25  10 6
3.1  5 ns
(3) M2 is faster than M1 by =1.2 times
3.25  4 ns

284
5. Draw the circuit of a 1-bit ALU that performs AND, OR, addition on inputs a and
b, or a and b . Use the basic AND, OR, Inverter, and Multiplexer gates.

Answer: 1-bit ALU


Binvert Operation
CarryIn

a
0

1 Result

b 0 2
1

CarryOut
Control
Binvert CarryIn Operation0 Operation1 Function
0  0 0 AND
0  0 1 OR
0 0 1 0 ADD
1 1 1 0 SUB

6. Assume that both the logical and physical address are 16-bits wide, the page size
is 1K (1024) bytes, and one-level paging is used
(a) How many entries are there in the page table?
(b) How many bits are occupied by each entry of the page table? If the logical
address is still 16-bits-wide, but the physical address in extended to 20 bits
wide
(c) In order to do 16  20 mapping, some modification to the original page table
is needed. What‘s it?

Answer:
(a) 216 / 1024 = 64
(b) Length of a entry = valid bit + the size of physical page number = 1 + (20 –
10) = 11
(c) increase the length of each entry to contain the enlarged physical page number

285
93 年暨南資工

1. We wish to compare the performance of two different machines: M1 and M2. The
following measurements have been made on these machines:
Program Time on M1 Time on M2
1 10 seconds 5 seconds
2 4 seconds 6 seconds

Instructions Instructions
Program
executed on M1 executed on M2
1 200 ×106 160 × 106
2 100 × 106 120 × 106
(a) Which machine is faster for each program and by how much?
(b) Find the instruction execution rate (instructions per second) for each machine
when running program 1 & 2.
(c) If the clock rate of machines M1 and M2 are 200 MHz and 500 MHz,
respectively, find the clock cycles per instruction for program 1 & 2 on both
machines using the data in Problem (a) and (b).

Answer:
(a) For Program 1, M2 is faster than M1 by 10/5 = 2 times
For Program 2, M1 is faster than M2 by 6/4 = 1.5 times
(b)
Exe. rate M1 M2
Program 200  10 6 160  10 6
 20  10 6  32  10 6
1 10 5
Program 100  10 6 6 120  10 6
2  25  10  20  10 6
4 6
(c)
CPI M1 M2
Program 10  200  10 6 5  500  10 6
 10  15.625
1 200  10 6
160  10 6
Program 4  200  10 6 6  500  10 6
8  25
2 100  10 6 120  10 6

2. Add 6.42ten × 101 to 9.51ten × l02, assuming that you have only three significant
decimal digits. Round to the nearest decimal number, first with guard and round
digits and then without them. Explain your work step by step.

286
Answer:
with guard and round digits:
6.42ten × 101 + 9.51ten × 102 = 0.642ten × 102 + 9.51ten × 102 = 10.152 ten × 102 =
1.0152 ten×103 = 1.02 ten×103
without guard and round digits:
6.42ten × 101 + 9.51ten × 102 = 0.64ten × 102 + 9.51ten × 102 = 10.1ten × 102 = 1.01ten
× 103

3. Assuming a 32-bit address, design


(1) a direct-mapped cache with 1024 blocks and a block size of 16 bytes (4 words).
(2) a two-way set-associative cache with 1024 blocks and a block size of 16 bytes.

Answer:
(1) (2) Address
32 bits 31 30 12 11 10 9 8 321 0

Byte
offset 2219 89
20 10
H it 18 10 D ata
Ta g

Ind ex
Index V Tag Data V Tag Data
0
In de x V alid T ag D a ta 1
0 2
1
2
510
511

10 21
10 22
10 23
128
20 32 128
18 128

2-to-1 MUX

Hit Data

287
4. Use add rd, rs, rt (addition) and addi rd, rs, imm (addition immediate) instructions
only to show the minimal sequence of MIPS instructions for the statement a = b ×
7 – 8; Assume that a corresponds to register $s0 and b corresponds to register $s1.

Answer:
add $s0, $s1, $s1 # $s0 = 2b
add $t0, $s0, $s1 # $t0 = 3b
add $s0, $s0, $s0 # $s0 = 4b
add $s0, $s0, $t0 # $s0 = 7b
addi $s0, $s0, –8 # $s0 = 7b – 8

288
92 年暨南資工

1. Explain the following terminologies.


(1) stack frame
(2) nonuniform memory access
(3) write-through
(4) multiple-instruction issue
(5) out-of-order commit

Answer:
(1) When performs a function call, information about the call is generated. That
information includes the location of the call, the arguments of the call, and the
local variables of the function being called. The information is saved in a
block of data called a stack frame.
(2) A type of single-address space multiprocessor in which some memory
accesses are faster than others depending which processor asks for which
word.
(3) A scheme in which writes always update both the cache and the memory,
ensuring that data is always consistent between the two.
(4) A scheme whereby multiple instructions are launched in 1 clock cycle.
(5) A commit in which the results of pipelined execution are written to the
programmer-visible state in the different order that instruction are fetched.

2. Suppose that in 1000 memory references there are 50 misses in the first-level
cache, 20 misses in the second-level cache, and 5 misses in the third-level cache,
what are the various miss rates? Assume the miss penalty from the L3 cache to
memory is 100 clock cycles, the hit time of the L3 cache is 10 clocks, the hit time
of the L2 cache is 4 clocks, the hit time of L1 is 1 clock cycle, and there are 1.2
memory references per instruction. What is the average memory access time
(average cycles per memory access) and average stall cycles per instruction?
Ignore the impact of writes.

Answer:
Global miss rate for L1 = 50 / 1000 = 0.05, for L2 = 20 / 1000 = 0.02, and for L3
= 5 / 1000 = 0.005
Average memory access time = (1 + 0.05 × 4 + 0.02 × 10 + 0.005 × 100) = 1.9
The average stall cycles per instruction = 1.2 × (1.9 – 1) = 1.08
另解:
Local miss rate for L1 = 50 / 1000 = 0.05, L2 = 20 / 50 = 0.4, L3 = 5 / 20 = 0.25
Average memory access time = 1 + 0.05 × (4 + 0.4 × (10 + 0.25 × 100)) = 1.9

289
3. Describe how to reduce the miss rate of a cache and list the classes of cache
misses that exits.

Answer:
(1) Compulsory (cold start or process migration, first reference): first access to a
block
Solution: increase block size
(2) Conflict (collision): Multiple memory locations mapped to the same cache
location. Occur in set associative or direct mapped cache
Solution: increase associativity
(3) Capacity: Cache cannot contain all blocks accessed by the program
Solution: increase cache size

4. Draw a configuration showing a processor, four 16k × 8-bit ROMs, and a bus
containing 16 address lines and 8 data lines. Add a chip-select logic block that
will select one of the four ROM modules for each of the 64k addresses

Answer:

CPU
address data

A15 A14

2 × 4 decoder

A13 ~ A0 D7 ~ D0

cs cs cs cs

1 6k × 8 ROM 1 6k × 8 ROM 1 6k × 8 ROM 1 6k × 8 ROM

290
96 年台師大資工

1. Consider a cache with 2K blocks and a block size of 16 bytes. Suppose the
address is 32 bits.
(a) Suppose the cache is direct-mapped. Find the number of sets in the cache.
Compute the number of tag bits per cache block.
(b) Repeat part (a) when the cache becomes a 2-way set associative cache.
(c) Repeat part (a) when the cache becomes a fully associative cache.

Answer:
(a) The number of sets = the number of blocks in cache = 2K
The length of index field =11, and the length of offset field = 4
 Tag size = 32 – 11 – 4 = 17
(b) The number of sets = 2K/2 = 1K
The length of index field =10, and the length of offset field = 4
 Tag size = 32 – 10 – 4 = 18
(c) The number of sets = 1
The tag size = 32 – 4 = 28

2. Suppose we have two implementations of the same instruction set architecture.


Computer A has a clock cycle time of 500 ps, and computer B has a clock rate of
2.5 GHz. Consider a program having 1000 instructions.
(a) Suppose computer A has a clock cycles per instruction (CPI) 2.3 for the
program. Find the CPU time (in ns) for the computer A.
(b) Suppose the CPU time of the computer B is 800 ns for the same program.
Compute the CPI of computer B for the program.

Answer:
(a) CPU time for computer A = 1000  2.3  500 ps = 1150 ns
(b) 800 ns = 1000  CPIB  0.4 ns  CPIB = 2

3. Assume a MIPS processor executes a program having 800 instructions. The


frequency of loads and stores in the program is 25%. Moreover, an instruction
cache miss rate for the program is 1%, and a data cache miss rate is 4%. The miss
penalty is 100 cycles for all misses.
(a) Find the total number of instruction miss cycles.
(b) Find the total number of data miss cycles.

Answer:
(a) The total instruction miss cycles = 800  1%  100 = 800
(b) The total data miss cycles = 800  0.25  4%  100 = 800

291
4. Consider a 5-stage (IF, ID, EX, MEM, WB) MIPS pipeline processor with hazard
detection unit. Suppose the processor has instruction memory for IF stage, and
data memory for MEM stage so that the structural hazard for memory references
can be avoided.
(a) Assume no forwarding unit is employed for the pipeline. We are given a code
sequence shown below.
LD R1, 10(R2); R1  MEM[R2 + 10]
SUB R4, R1, R6; R4  R1 − R6
ADD R5, R1, R6; R5  R1 + R6
Show the timing of each instruction of the code sequence. Your answer may
be in the following form.
Clock Cycle
Instruction
1 2 3 4 5 6 7 8 9 10
LD R1, 10(R2) IF ID EX MEM WB
SUB R4, R1, R6
ADD R5, R1, R6
(b) Repeat part (a) when a forwarding unit is used.
(c) Consider another code sequence shown below.
SUB R1, R3, R8; R1  R3 − R8
SUB R4, R1, R6; R4  R1 − R6
ADD R5, R1, R6; R5  R1 + R6
Suppose both hazard detector and forwarding unit are employed. Show the
timing of each instruction of the code sequence.

Answer:
(a) Suppose that the register read and write can happen in the same clock cycle.
Clock Cycle
Instruction
1 2 3 4 5 6 7 8 9 10
LD R1, 10(R2) IF ID EX MEM WB
SUB R4, R1, R6 IF ID ID ID EX MEM WB
ADD R5, R1, R6 IF IF IF ID EX MEM WB
(b)
Clock Cycle
Instruction
1 2 3 4 5 6 7 8 9 10
LD R1, 10(R2) IF ID EX MEM WB
SUB R4, R1, R6 IF ID ID EX MEM WB
ADD R5, R1, R6 IF IF ID EX MEM WB
(c)
Clock Cycle
Instruction
1 2 3 4 5 6 7 8 9 10
SUB R1, R3, R8 IF ID EX MEM WB
SUB R4, R1, R6 IF ID EX MEM WB
ADD R5, R1, R6 IF ID EX MEM WB

292
95 年台師大資工

1. (1) What is the decimal value of the following 32-bit two's complement number?
31 30 29 28 27 26 25 24 23 22 21 20 19 18 17 16 15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 0
1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 1 1 0 1 1 1 1 1 0 0

(2) What is the decimal value of the following IEEE 754 single-precision binary
representation?
31 30 29 28 27 26 25 24 23 22 21 20 19 18 17 16 15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 0
1 1 0 0 0 0 1 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0

Answer:
(1) 111111111111111111111011011111002 =
−000000000000000000000100100001002 = −(210 + 27 + 22) = −115610
(2) −1.012  2132 – 127 = −1.012  25 = −1.012 = −1010002 = −4010

2. Consider a five-stage (IF, ID, EX, MEM and WB) pipeline processor with hazard
detection and data forwarding units. Assume the processor has instruction
memory for IF stage and data memory for MEM stage so that the structural
hazard for memory references can be avoided.
(1) Suppose the following code sequence is executed on the processor. Determine
the average CPI (clock cycles per instruction) of the code sequence.
ADD R1, R2, R3; R1  R2+R3
SUB R4, R5, R6; R4  R5 − R6
(2) Repeat Part (1) for the following code sequence.
ADD R1, R2, R3; R1  R2+R3
SUB R4, R1, R6; R4  R1 − R6
(3) Repeat Part (1) for the following code sequence.
LD R3, 10(R7) R3  MEM[R7 + 10]
ADD R1, R2, R3; R1  R2 + R3
SUB R4, R1, R6; R4  R1 − R6

Answer:
(1) Clock cycles = (5 – 1) + 2 = 6
CPI = clock cycles / instruction count = 6 / 2 = 3
(2) Although there is a data hazard between instruction ADD and SUB, it can be
resolved by forwarding unit and no pipeline stall is needed.
Clock cycles = (5 – 1) + 2 = 6
CPI = clock cycles / instruction count = 6 / 2 = 3
(3) The data between ADD and SUB can be resolved by forwarding unit but the
data hazard between LD and ADD require one clock stall.
Hence, Clock cycles = (5 – 1) + 3 + 1 = 8
CPI = clock cycles / instruction count = 8 / 3 = 2.67

293
3. Consider a five-stage (IF, ID, EX, MEM and WB) pipeline processor with
instruction memory for IF stage and data memory for MEM stage. Suppose the
following code sequence is executed on the processor.
LD R2, 100(R1); R2  MEM[R1 + 100]
LD R4, 200(R3); R4  MEM[R3 + 200]
ADD R6, R2, R4; R6  R2 + R4
SUB R8, R2, R4; R8  R2 – R4
SD R6, 120(R1); MEM[R1 + 120]  R6
SD R8, 120(R3); MEM[R3 + 120]  R8
(1) Determine the total number of memory references.
(2) Determine the percentage of the memory references which are data
references.

Answer:
(1) Every instruction should be read from memory and there are 4 memory
reference instructions in the code sequence. Hence the number of memory
references = 6 + 4 = 10
(2) The percentage of the memory references = 4 / 10 = 40%

4. (1) Consider a direct mapped cache with 64 blocks and a block size of 32 bytes.
What block number does byte address 1600 map to?
(2) Repeat Part (1) for the byte address 3209.
(3) With a 32-bit virtual address, 8-KB pages, and 4 bytes per page table entry,
determine the total page table size (in MB).

Answer:
(1) Memory block address = 1600 / 32  50 , 50 mod 64 = 50
The cache block number for byte address 1600 to map is 50
(2) Memory block address = 3209 / 32  100 , 100 mod 64 = 36
The cache block number for byte address 3209 to map is 36
(3) No. of virtual pages = 232 / 8K = 219  there are 219 entries in page table
The size of page table = 219  4 bytes = 2MB

294
94 年台師大資工

1. Consider the 5-stage pipeline shown below.

0
M
u
x
1

IF/ID ID/EX EX/MEM MEM/WB


Add Add
result
Add

4 zero

Read
Instruction

PC Address register 1 Read


data 1
Read
register 2 Zero
Instruction
Registers Read ALU ALU
memory Write 0 Address Read
data 2 result 1
register M data
u Data M
Write x u
memory x
data 1
0
Write
data
16 32
Sign
extend

(a) Use the following load instruction as an example.


LD R5, 128(R1); #R5  M[128+R1]
Briefly explain the major operations of the pipeline at each stage.
(b) Consider the following code sequence.
LD R5, 128(R1); #R5  M[128 + R1]
ADD R3, R2, R5; # R3  R2 + R5
Will the execution of the ADD instruction cause a data hazard? Justify your
answer. If your answer is YES, determine whether the data hazard can be
removed by a forwarding technique.
(c) Consider the following code sequence.
ADD R5, R6, R7; # R5  R6 + R7
BNZ R5, exit; # goto exit if R5  0
LD R2, 64(R1); # R2  M[64 + R1]
exit: ADD R8, R7, R8; # R8  R7 + R8
Will the execution of the BNZ instruction cause a data hazard? Justify your
answer. If your answer is YES, determine whether the data hazard can be
removed by a forwarding technique.

Answer:
(a) IF stage: instruction ―LD R5, 128(R1)‖ is fetched from instruction memory.
ID stage: instruction is decode and register R5 and R1 are read from the
register file.
EX stage: memory address is calculated  sign-ext(128) + [R1]
MEM stage: data is read from memory using the address calculated at EX
stage.

295
WB stage: data read from memory at MEM stage is now write into R5.
(b) YES. The load-use data hazard can not removed by the forwarding technique
completely. We must stall one clock cycle between these two instructions and
then using forwarding to solve this hazard.
(c) YES. According to the graph the BNZ will complete in ID stage, hence the
forwarding technique can not solve it completely and we still have to stall for
one clock.

2. Consider a cache having 8K blocks. There are two words in each block. Each
word contains 4 bytes. Suppose the main memory is byte-addressed with a 32-bit
address bus.
(a) Suppose the cache is a four-way set associative cache. Find the total number
of sets and total number of tag bits.
(b) Suppose the cache is a fully associative cache. Find the total number of sets
and total number of tag bits.

Answer:
(a) Total number of sets = 8K / 4 = 2K = 211
The tag field has 32 – 3 – 11 = 18 bits
The total number of tag bits = 2K × 4 ×18 = 144 Kbits
(b) Total number of sets = 1
The tag field has 32 – 3 = 29 bits
The total number of tag bits = 1 × 8K ×29 = 232 Kbits

3. Briefly describe the LRU scheme for block replacement in a cache. Why the LRU
scheme may not be well suited for a fully associative cache? Justify your answer.

Answer:
(a) The block replaced is the one that has been unused for the longest time.
(b) Because tracking the usage information is costly.

4. Consider all the RAID systems (except the RAID 2).


(a) Which RAID system has no redundancy to tolerate disk failure?
(b) Which RAID system allows the recovery from the second failure?

Answer:
(a) RAID 0
(b) RAID 6 (P + Q)

296
5. What is the dynamic branch prediction? Briefly describe how a branch prediction
buffer can be used for the dynamic branch prediction.

Answer:
(a) Prediction of branches at runtime using runtime information
(b) A branch prediction buffer is a small memory indexed by the lower portion of
the address of the branch instruction. The memory contains a bit that says
whether the branch was recently taken or not.

297
93 年台師大資工

1. Consider a direct-mapped cache with 32K bytes of data and one-word (4-byte)
block. Suppose the main memory is byte-addressed with a 32-bit address bus.
(a) Determine the number of blocks in the cache.
(b) How many bits are required in the tag field associated with each cache block?
(c) Determine the total cache size (in bits).

Answer:
(a) 32K / 4 = 8K blocks = 213 blocks
(b) tag field has 32 – 13 – 2 = 17 bits
(c) The total cache size = 8K × (1 + 17 + 32) = 400K bits

2. (a) What is a translation look-aside buffer (TLB)?


(b) Does a TLB miss imply a page fault? Explain your answer.

Answer:
(a) A cache that keeps track of recently used address mappings to avoid an access
to the page table (page table的cache)
(b) No. TLB所存放的mappings只是page table的subset,若TLB miss而page
table hit則不會發生page fault,若page table miss則page fault才會發生。

3. Consider a processor with a five-stage pipeline as shown below:


Stage 1 IF Instruction fetch
Stage 2 ID Instruction decode and register file read
Stage 3 EX Execution or address calculation
Stage 4 MEM Data memory access
Stage 5 WB Write back
(a) Identify all the hazards in the following code.
Loop: ADD R2, R3, R4; R2  R3 + R4
ADD R5, R2, R6; R5  R2 + R6
SD R5, 100(R0); M[R0 + 100]  R5
ADD R0, R0, -1; R0  R0 - 1
BNZ R0, Loop; If R0  0, goto Loop
(b) Which hazards found in part (a) can be resolved via forwarding?

Answer:
(a) lines (1, 2) for R2, lines (2, 3) for R5, lines (4, 5) for R0
(b) Data hazards for (1, 2) and (2, 3) can be resolved via forwarding.
If branch decision is made at MEM stage than (4, 5) can be resolved via
forwarding. If branch decision is made at ID stage than (4, 5) can not be
resolved via forwarding.

298
4. The snooping protocols are the most popular protocols for maintaining cache
coherence in a multiprocessor system. The snooping protocols are of two types:
write-invalidate and write-update.
(a) Briefly describe each type of the snooping protocol.
(b) Which type has less demand on bus bandwidth? Explain your answer.

Answer:
(1) Write-invalidate: The writing processor causes all copies in other caches to be
invalidated before changing its local copy; the writing processor issues an
invalidation signal over the bus, and all caches check to see if they have a
copy; if so, they must invalidate the block containing the word.
Write-update: Rather than invalidate every block that is shared, the writing
processor broadcasts the new data over the bus; all copies are then updated
with the new value. This scheme, also called write-broadcast, continuously
broadcasts writes to shared data, while write-invalidate deletes all other
copies so that there is only one local copy for subsequent writes.
(2) Write-invalidate has less demand on bus bandwidth. Because write-update
forces each access to shared data always to go around the cache to memory, it
would require too much bus bandwidth

299
92 年台師大資工

1. (a) Describe the IEEE 754 floating-point standard


(b) Show the IEEE 754 binary number representation of the decimal numbers
-0.1875 in single precision.

Answer:
(a) IEEE Standard 754 floating point is the most common representation today
for real numbers on computers. The characteristics of the IEEE 754 are
described as follows:
1. The sign bit is 0 for positive, 1 for negative.
2. The exponent‘s base is two.
3. The exponent field contains 127 plus the true exponent for
single-precision, or 1023 plus the true exponent for double precision.
4. The first bit of the mantissa is typically assumed to be 1.f, where f is the
field of fraction bits.
(b) – 0.187510 = – 0.00112 = – 1.12 × 2–3
The single precision format is: 1 01111100 10000000000000000000000

2. (a) Briefly describe three major types of pipeline hazards.


(b) What is the branch prediction technique? Which type of the hazard may be
solved by the branch prediction technique?
(c) What is the data forwarding technique? Which type of the hazard may be
solved by the data forwarding technique?

Answer:
(a) Structural hazards: hardware cannot support the instructions executing in the
same clock cycle (limited resources)
Data hazards: attempt to use item before it is ready. (Data dependency:
instruction depends on result of prior instruction still in the pipeline)
Control hazards: attempt to make a decision before condition is evaluated
(branch instructions)
(b) The processor tries to predict whether the branch instruction will jump or not.
Branch prediction may resolve control hazard
(c) A method of resolving a data hazard by retrieving the missing data element
from internal buffers rather than waiting for it to arrive from
programmer-visible registers or memory.
Data hazard can be solved by the data forwarding technique.

300
3. (a) Briefly describe the direct-mapped cache structure.
(b) Briefly describe the fully associative cache structure.
(c) Suppose, we consider only the direct-mapped and fully associative cache
structures. Which structure has higher hardware cost for block searching?
Which structure usually has higher cache miss rate? Explain your answer

Answer:
(a) A cache structure in which each memory location is mapped to exactly one
location in the cache.
(b) A cache structure in which a block can be placed in any location in the cache.
(c) Fully associative cache has higher hardware cost for block searching because
it need more comparators for comparison. Besize, we need more cache bits to
store the tags.
Direct-mapped cache has higher cache miss rate because conflicts among
memory locations are high.

4. (1) Briefly describe the basic concept of the direct memory access (DMA). What
advantages may the DMA have as compared with the polling and
interrupt-driven data transfer techniques?
(2) Briefly describe the three steps in a DMA transfer.

Answer:
(1) DMA: a mechanism that provides a device controller the ability to transfer
data directly to or from the memory without involving the processor.
Other than polling and interrupt transfer techniques which both consume CPU
cycles, during data transfer DMA is independent of the processor and without
consuming all the processor cycles
(2)
Step 1: The processor sets up the DMA by supplying the identity of the
device, the operation to perform on the device, the memory address
that is the source or destination of the data to be transferred, and the
number of bytes to transfer.
Step 2: The DMA starts the operation on the device and arbitrates for the
bus.
Step 3: Once the DMA transfer is complete, the controller interrupts the
processor.

301
93 年彰師大資工

1. Suppose a computer‘s address size is k bits (using byte addressing), the cache size,
is S bytes/the block size is B bytes, and the cache is A-way set-associative.
Assume that B is a power of two, so B = 2b. Figure out what the following
quantities are in terms of S, B, A, b, and k: the number of sets in the cache, the
number of index bits in the address, and the number of bits needed to implement
the cache. Derive the quantities step by step clearly and explain the reason for
each step.

Answer:
Address size: k bits
Cache size: S bytes/cache
Block size: B = 2b bytes/block
Associativity: A blocks/set
Number of sets in the cache = S/AB; Number of bits for index =
 S  S
log 2    log 2    b
 AB   A
 S  S
Number of bits for the tag = k   log 2    b   b  k  log 2  
  A   A
Number of bits needed to implement the cache = sets/cache × associativity ×
(data + tag + valid)
S  S  S  S 
=  A   8  B  k  log 2    1    8  B  k  log 2    1 bits
AB   A  B   A 

2. Here is a series of address references given as word addresses: 1, 4, 8, 5, 20, 17,


19, 56, 9, 11, 4, 43, 5, 6, 9, 17. Assume that a 2-way set-associative cache is with
four-word blocks and a total size is 32 words. The cache is initially empty and
adopts LRU replacement policy. Label each reference in the list as a hit or a miss
and show the final contents of the cache.

Answer:
Referenced Referenced Contents
Address Address Tag Index Hit/Miss
(decimal) (Binary) set Block0 Block1
1 000001 0 0 Miss 0 0,1,2,3
4 000100 0 1 Miss 1 4,5,6,7
8 001000 0 2 Miss 2 8,9,10,11
5 000101 0 1 Hit 1 4,5,6,7
20 010100 1 1 Miss 1 4,5,6,7 20,21,22,23
17 010001 1 0 Miss 0 0,1,2,3 16,17,18,19

302
19 010011 1 0 Hit 0 0,1,2,3 16,17,18,19
56 111000 3 2 Miss 2 8,9,10,11 56,57,58,59
9 001001 0 2 Hit 2 8,9,10,11 56,57,58,59
11 001011 0 2 Hit 2 8,9,10,11 56,57,58,59
4 000100 0 1 Hit 1 4,5,6,7 20,21,22,23
43 101011 2 2 Miss 2 8,9,10,11 40,41,42,43
5 000101 0 1 Hit 1 4,5,6,7 20,21,22,23
6 000110 0 1 Hit 1 4,5,6,7 20,21,22,23
9 001001 0 2 Hit 2 8,9,10,11 40,41,42,43
17 010001 1 0 Hit 0 0,1,2,3 16,17,18,19

Set Block0 Block1


0 0,1,2,3 16,17,18,19
1 4,5,6,7 20,21,22,23
2 8,9,10,11 40,41,42,43
3

3. A superscalar MIPS machine is implemented as follows. Two instructions are


issued per clock cycle. One of the instructions could be an integer ALU operation
or branch, and the other could be a load or store. Given the following loop, please
unroll the loop twice first and then schedule the codes to maximize performance.
Indicate which instruction(s) will be executed in each clock cycle. Assume that
the loop index is a multiple of three.
Loop: lw $t0, 0($s1) // $t0 = array element
addu $t0, $t0, $s2 // add scalar in $s2
sw $t0, 0($s1) // store result
addi $s1, $s1, -4 // decrement pointer
bne $s1, $zero, Loop //branch $s1 != 0

Answer:
(1) lw $t0, 12($s1)
addu $t0, $t0, $s2
sw $t0, 12($s1)
lw $t1, 8($s1)
addu $t1, $t1, $s2
sw $t1, 8($s1)
lw $t2, 4($s1)
addu $t2, $t2, $s2
sw $t2, 4($s1)
addi $s1, $s1, -12
bne $s1, $zero, Loop
(2)

303
ALU or branch Data transfer
Clock cycle
instruction instruction
Loop: addi $s1, $s1, -12 lw $t0, 0($s1) 1
lw $t1, 8($s1) 2
addu $t0, $t0, $s2 lw $t2, 4($s1) 3
addu $t1, $t1, $s2 sw $t0, 12($s1) 4
addu $t2, $t2, $s2 sw $t1, 8($s1) 5
bne $s1, $zero, Loop sw $t2, 4($s1) 6

4. Consider a pipelined MIPS machine with the following five stages:


IF: fetch instruction form memory
ID: read registers while decoding the instruction
EXE: execute the operation or calculate an address
MEM: access an operand in data memory
WB: write the result into a register
Given the following codes, identify all of the data dependencies and explain
which hazards can be resolved via forwarding.
lw $s0, 12($s1) // load data into $s0
add $s4, $s0, $s2 // $s4 = $s0 + $s2
addi $s2, $s0, 4 // $s2 = $s0 + 4
sw $s4, 12($s1) // store $s4 to memory
add $s2, $s3, $s1 // $s2 = $s3 + $s1

Answer:
lines (1, 2) for $s0: can not be resolved by forwarding completely. Need stalling 1
clock.
lines (1, 3) for $s0: can be resolved by forwarding.
lines (2, 4) for $s4: can be resolved by forwarding.

304
95 年東華資工

1. Assume that a processor is a load-store RISC CPU, running with 600 MHz. The
instruction mix and clock cycles for a program as follows:
Instruction type Frequency Clock cycles
A 25% 2
B 10% 2
C 15% 3
D 30% 4
E 20% 1
(a) Find the CPI.
(b) Find the MIPS.

Answer:
(a) CPI = 0.25  2 + 0.1  2 + 0.15  3 + 0.3  4 + 0.2  1 = 2.55
(b) MIPS = (600  106) / (1.47  106) = 235.29

2. We make an enhancement to a computer that improves some mode of execution


by a factor of 10. Enhanced mode is used 80% of the time, measured as a
percentage of the execution time when the enhanced mode is in use.
(a) What is the speedup we have obtained from fast mode?
(b) What percentage of the original execution time has been converted to fast
mode.
Hint: The Amdahl‘s Law depends on the fraction of the original, unenhanced
execution time that could make use of enhanced mode. Thus, we cannot directly
use this 80% measurement to compute speedup with Amdahl‘s Law.

Answer:
(a) Speedup = Timeunenhanced/Timeenhanced
The unenhanced time is the sum of the time that does not benefit from the 10
times faster speedup, plus the time that does benefit, but before its reduction
by the factor of 10. Thus,
Timeunenhanced = 0.2  Timeenhanced + 10  0.8  Timeenhanced = 8.2 Timeenhanced
Substituting into the equation for speedup gives us:
Speedup = Timeunenhanced/Timeenhanced = 8.2
(b) Using Amdahl‘s Law, the given value of 10 for the enhancement factor, and
the value for Speedup from Part(a), we have:
8.2 = 1 / [(1- f) + (f /10)]  f = 0.9756
Solving shows that the enhancement can be applied 97.56% of the original
time.

305
3. The following code fragment processes two arrays and produces an important
value in register $v0. Assume that each array consists of 1000 words indexed 0
through 999, that the base addresses of the arrays are stored in $a0 and $a1
respectively, and their sizes (1000) are stored in $a2 and $a3, respectively.
Assume that the code is run on a machine with a 1 GHz clock. The required
number of cycles for instruction add, addi and sll are all 1 and for instructions lw
and bne are 2. In the worst case, how many seconds will it take to execute this
code?
sll $a2, $a2, 2
sll $a3, $a3, 2
add $v0, $zero, $zero
add $t0, $zero, $zero
outer: add $t4, $a0, $t0
lw $t4, ,0($t4)
add $t1, $zero, $zero
inner: add $t3, $al, $t1
lw $t3, 0($t3)
bne $t3, $t4, skip
addi $v0, $v0, 1
skip: addi $t1, $t1, 4
bne $t1, $a3, inner
addi $t0, $t0, 4
bne $t0, $a2, outer

Answer:
1. Before outer loop there are 4 instructions require 4 cycles.
2. Outer loop has 3 instructions before the inner loop and 2 after. The cycles
needed to execute 1 + 2 + 1 + 1 + 2 = 7 cycles per iteration, or 1000 × 7
cycles.
3. The inner loop requires 1 + 2 + 2 + 1 + 1 + 2 = 9 cycles per iteration and it
repeats 1000 × 1000 times, for a total of 9 × 1000 × 1000 cycles.
The total number of cycles executed is therefore 4 + (1000 × 7) + (9 × 1000 ×
1000) = 9007004. The overall execution time is therefore (9007004) / (1 ×109)
= 9 ms.

306
4. Draw the gates for the Sum bit of an adder for the following equation ( a means
NOT a).
Sum  (a  b  CarryIn)  (a  b  CarryIn)  (a  b  CarryIn)  (a  b  CarryIn)

Answer:
CarryIn

b
Sum

5. (a) Please explain the difference between ―write-through‖ policy and ―write-back‖
policy?
(b) Assume that the instruction cache miss rate is 4% and the data cache miss rate
is 5%. If a processor has a CPI of 2.0 without any memory stalls and the miss
penalty is 200 cycles for all misses, determine how much faster a processor
would run with a perfect cache that never missed? Here, the frequency of loads
and stores is 35%.

Answer:
(a) Write-through: The information is written to both the block in the cache and to
the block in the lower level of the memory hierarchy.
Write-back: The information is written only to the block in the cache. The
modified block is written to the lower level of the hierarchy only when it is
replaced.
(b) The CPI considering stalls is 2 + 0.04  200 + 0.05  0.35  200 = 13.5
The processor run with a perfect cache is 13.5 / 2 = 6.75 times faster than
without a perfect cache.

307
6. Please explain the following terms: (a) compulsory misses, (b) capacity misses,
and (c) conflict misses.

Answer:
(a) A cache miss caused by the first access to a block that has never been in the
cache.
(b) A cache miss that occurs because the cache even with fully associativity, can
not contain all the block needed to satisfy the request.
(c) A cache miss that occurs in a set-associative or direct-mapped cache when
multiple blocks compete for the same set.

308
96 年台科大電子

1. Given the number 0x811F00FE, what is it interpreted as:


(a) Four two's complement bytes?
(b) Four unsigned bytes?

Answer:
0x811 F00FE = 1000 0001 0001 1111 0000 0000 1111 11102
(a) −127 31 0 −2
(b) 129 31 0 254

2. Given the following instruction mix, what is the CPI for this processor?
Operation Frequency CPI
A 50% 1
B 15% 4
C 15% 3
D 10% 4
E 5% 1
F 5% 2

Answer:
CPI = 1  0.5 + 4  0.15 + 3  0.15 + 4  0.1 + 1  0.05 + 2  0.05 = 2.1

3. The following piece of code has pipeline hazard(s) in it. Please try to reorder the
instructions and insert the minimum number of NOP to make it hazard-free.
(Note: Assume all the necessary forwarding logics exist)
haz: move $5, $0
lw $10, 1000($20)
addiu $20, $20, -4
addu $5, $5, $10
bne $20, $0, haz

Answer:
haz: lw $10, 1000($20)
addiu $20, $20, -4
move $5, $0
bne $20, $0, haz
addu $5, $5, $10

註:依照白算盤教科書 branch 指令若題目沒有特別說明,都假設有經過最佳化


並在 ID 階段決定要不要跳。

309
4. Given a MIPS machine with 2-way set-associative cache that has 2-word blocks
and a total size of 32 words. Assume that the cache is initially empty, and that it
uses an LRU replacement policy. Given the following memory accesses in
sequence:
0ff00f70
0ff00f60
0fe0012c
0ff00f5c
0fe0012c
0fe001e8
0f000f64
0f000144
0fe00204
0ff00f74
0f000f64
0f000128
(a) Please label whether they will be hits or misses.
(b) Please calculate the hit rate.

Answer:
(a)
Byte address16/2
Byte address16 Tag Hit/Miss
index offset
Hex. part Bin. part
0ff00f70 0ff00f 01 110 000 Miss
0ff00f60 0ff00f 01 100 000 Miss
0fe0012c 0fe001 00 101 100 Miss
0ff00f5c 0ff00f 01 011 100 Miss
0fe0012c 0fe001 00 101 100 Hit
0fe001e8 0fe001 11 101 000 Miss
0f000f64 0f000f 01 100 100 Miss
0f000144 0f0001 01 000 100 Miss
0fe00204 0fe002 00 000 100 Miss
0ff00f74 0ff00f 01 110 100 Hit
0f000f64 0f000f 01 100 100 Hit
0f000128 0f0001 00 101 000 Miss

(b) Hit rate = 3/12 = 0.25 = 25%

310
5. The speed of the memory system affects the designer's decision on the size of the
cache block. Which of the following cache designer guidelines are generally valid?
why?
(a) The shorter the memory latency, the smaller the cache block.
(b) The shorter the memory latency, the larger the block.
(c) The higher the memory bandwidth, the smaller the cache block.
(d) The higher the memory bandwidth, the larger the cache block.

Answer: (a) and (d)


A lower miss penalty can lead to smaller blocks, yet higher memory bandwidth
usually leads to larger blocks, since the miss penalty is only slightly larger.

6. Please state whether the following techniques are associated primarily with a
software- or hardware-based approach to exploiting ILP. In some cases, the
answer may be both.
(a) Brach prediction
(b) Dynamic scheduling
(c) Our-of-order execution
(d) EPIC
(e) Speculation
(f) Multiple issue
(g) Superscalar
(h) Reorder buffer
(i) Register renaming
(j) Predication

Answer:
(a) (b) (c) (d) (e) (f) (g) (h) (i) (j)
B H H B B B H H B S
H: hardware, S: software, B: both

7. What is Saturating Arithmetic? What kind instructions use this feature?

Answer:
(1) Saturation arithmetic is used in graphics routines. As an example, assume you
add together two medium-red pixels. Saturating arithmetic ensures the result
is a dark red or black. It's certainly different than regular integer math, where
you could perform the above operation and end up with a light-colored result.
(2) Intel MMX supports both signed and unsigned saturating arithmetic.

311
8. Please describe the Shift-and-Add multiplier architecture and its control steps.

Answer:
Shift-and-Add multiplier architecture control steps

Start

Product0 = 1 Test Product0 = 0


Product0

1a. Add multiplicand to the


left half of product & place
the result in the left half of
Product register

2. Shift the Product


register right 1 bit.

No: < 32 repetitions


32nd
repetition?

Yes: 32 repetitions
Done

9. What fields are contained in TLB (translation lookaside buffer)? What are the
purposes of these fields?

Answer:
(a) valid: to indicate the page to be access is in the physical memory.
(b) dirty: to indicate whether this page should be write back.
(c) reference: to help deciding which page should be replaced.
(d) tag: identify whether the associated mapping is in the TLB.
(e) physical page number: to indicate which physical page the virtual page
mapped.

10. How many tag-comparators are needed in a 2-way set associative cache controller?
Why?

Answer:
2 comparators
It is because each set has two blocks and each block should be searched.

312
11. What is Cache Line Width? Why is it larger than the word-size of CPU?

Answer:
(a) Cache line width: cache block sizes, i.e., byte in a cache block
(b) To include more spatial locality.

12. Use Verilog or VHDL languages to design a one-bit 8-to-1 multiplexer circuit.

Answer:
ENTITY mux8_1 IS

PORT
(sel :IN STD_LOGIC_VECTOR(2 downto 0);
d0, d1, d2, d3, d4, d5, d6, d7 :IN STD_LOGIC;
z :OUT STD_LOGIC);

END mux8_1;

ARCHITECTURE behavior OF mux8_1 IS


BEGIN

WITH sel SELECT


z <= d0 when "000",
d1 when "001",
d2 when "010",
d3 when "011",
d4 when "100",
d5 when "101",
d6 when "110",
d7 when "111",
'0' when others;

END behavior;

313
95 年海洋資工

1. Convert these RTL descriptions for a multi-cycle MIPS CPU datapath into a
control specification and FSM state diagram.
Action for R-type Action for Memory- Action for Action for
Step Name
Instructions Reference Instructions branches jumps
IR  Memory[PC]
Instruction fetch
PC  PC+4
A  Reg[IR[25-21]]
Instruction decode/register fetch B  Reg[IR[20-16]]
ALUOut  PC + sign-extend(IR[15-0]) << 2
Execution, address computation, ALUOut  A + sign-extend If (A==B) then PC  {PC[31-28],
ALUOut  A op B
branch/jump completion (IR[15-0]) PC  ALUOut; (IR[25-0]<<2)}
Load: MDR  Memory[ALUOut]
Memory Access or R-type Reg[IR[15-11]] 
or
completion ALUOut
Store: Memory[ALUOut]  B
Memory read completion Load: Reg[IR[20-16]]  MDR

Answer:

314
2. A muti-cycle CPU has 3 implementations. The first one is a 5-cycle
IF-ID-EX-MEM-WB design running at 4.8 GHz, where load takes 5 cycles,
store/R-type 4 cycles and branch/jump 3 cycles. The second one is a 6-cycle
design running 5.6 GHz, with MEM replaced by MEM1 & MEM2. The third is a
7-cycle design running at 6.4 GHz, with IF further replaced by IF1 & IF2.
Assume we have an instruction mix: load 26%, store 10%, R-type 49%,
branch/jump 15%. Do you think it is worthwhile to go for the 6-cycle design over
the 5-cycle design? How about the 7-cycle design, is it worthwhile? Please give
your rationales.

Answer:
The average CPI for implementation 1 is:
5  0.26 + 4  0.1 + 4  0.49 + 3  0.15 = 4.11
The execution time for an instruction in implementation 1 = 4.11/4.8G = 0.86 ns
The average CPI for implementation 2 is:
6  0.26 + 5  0.1 + 4  0.49 + 3  0.15 = 4.47
The execution time for an instruction in implementation 2 = 4.47/5.6G = 0.80 ns
The average CPI for implementation 3 is:
7  0.26 + 6  0.1 + 5  0.49 + 4  0.15 = 5.47
The execution time for an instruction in implementation 3 = 5.47/6.4G = 0.85 ns
It is worthwhile to go for the 6-cycle design over the 5-cycle design, but it is not
worthwhile to go for 7-cycle design over the 6-cycle design.

3. We have a program core consisting of five conditional branches. The program


core will be executed thousands of times. Below are the outcomes of each branch
for one execution of the program core (T for taken, N for not taken).
Branch 1: T-T-T
Branch 2: N-N-N-N
Branch 3: T-N-T-N-T-N
Branch 4: T-T-T-N-T
Branch 5: T-T-N-T-T-N-T
Assume the behavior of each branch remains the same for each program core
execution. For dynamic schemes, assume each branch has its own prediction
buffer and each buffer initialized to the same state before each execution. List the
predictions for the following branch prediction schemes:
a. Always taken
b. Always not taken
c. 1-bit predictor, initialized to predict taken
d. 2-bit predictor, initialized to weakly predict taken
What are the prediction accuracies?

Answer:

315
Branch 1: prediction: T-T-T, right: 3, wrong: 0
Branch 2: prediction: T-T-T-T, right: 0, wrong: 4
Branch 3: prediction: T-T-T-T-T-T, right: 3, wrong: 3
(a) Branch 4: prediction: T-T-T-T-T, right: 4, wrong: 1
Branch 5: prediction: right: 5, wrong: 2
T-T-T-T-T-T-T,
Total: right: 15, wrong: 10, Accuracy = 100% × 15/25 = 60%
Branch 1: prediction: N-N-N, right: 0, wrong: 3
Branch 2: prediction: N-N-N-N, right: 4, wrong: 0
Branch 3: prediction: N-N-N-N-N-N, right: 3, wrong: 3
(b) Branch 4: prediction: N-N-N-N-N, right: 1, wrong: 4
Branch 5: prediction: right: 2, wrong: 5
N-N-N-N-N-N-N,
Total: right: 10, wrong: 15, Accuracy = 100% × 10/25 = 40%
Branch 1: prediction: T-T-T, right: 3, wrong: 0
Branch 2: prediction: T-N-N-N, right: 3, wrong: 1
Branch 3: prediction: T-T-N-T-N-T, right: 1, wrong: 5
(c) Branch 4: prediction: T-T-T-T-N, right: 3, wrong: 2
Branch 5: prediction: right: 3, wrong: 4
T-T-T-N-T-T-N,
Total: right: 13, wrong: 12, Accuracy = 100% × 13/25 = 52%
Branch 1: prediction: T-T-T, right: 3, wrong: 0
Branch 2: prediction: T-N-N-N, right: 3, wrong: 1
Branch 3: prediction: T-T-T-T-T-T, right: 3, wrong: 3
(d) Branch 4: prediction: T-T-T-T-T, right: 4, wrong: 1
Branch 5: prediction: right: 5, wrong: 2
T-T-T-T-T-T-T,
Total: right: 18, wrong: 7, Accuracy = 100% × 18/25 = 72%

316
95 年元智資工

1. (a) Show the IEEE 754 binary representation for the floating-point number 0.1ten in
single precision.
(b) Add 2.56ten×102 to 2.34ten× 104, assuming that we have only 3 significant
decimal digits (No guard and round digits are used).
(c) What‘s the number of ulp (units in the last place) in (b).
(d) Assume that you have only 4 significant decimal digits. Round the number
12.4650 to nearest even.
Answer:
(a) 0.1ten = 0.00011two = 1.10011two × 2–4
Sign = 0, Significand = .10011
Exponent = –4 + 127 = 123
0 01111011 10011001100110011001100
(b) 2.56ten×102 + 2.34ten× 104 = 0.02 ten×104 + 2.34ten× 104 = 2.36ten× 104
(c) 2
(d) 12.46

2. Suppose that in 1000 memory references there are 60 misses in the first-level
cache, 30 misses in the second-level cache, and 5 misses in the third-level cache.
Assume the miss penalty from the L3 cache to memory is 100 clock cycles, the
hit time of the L3 cache is 10 clocks, the hit time of the L2 cache is 5 clocks, the
hit time of L1 is 1 clock cycle, and there are 1.5 memory references per
instruction.
(a) What‘s the global miss rate for each level of caches?
(b) What‘s the local miss rate for each level of caches?
(c) What is the average memory access time?
(d) What is the average stall cycle per instruction?

Answer:
(a) L1 = 60/1000 = 0.06, L2 = 30/1000 = 0.03, L3 = 5/1000 = 0.005
(b) L1 = 60/1000 = 0.06, L2 = 30/60 = 0.5, L3 = 5/30 = 0.167
(c) AMAT = 1 + 0.06  5 + 0.03  10 + 0.005  100 = 2.1 clock cycles
(d) (2.1 – 1)  1.5 = 1.65 clock cycles

3. (a) Consider a virtual memory system with the following properties: 38-bit virtual
byte address, 8 KB pages, 36-bit physical byte address. What is the total size of
the page table for each process on this processor, assuming that the memory
management bits take a total of 8 bits and that all the virtual pages are in use?
(Assume each entry in the page table should be round up to full bytes.)
(b) Briefly describe at least 3 techniques to minimize the memory dedicated to
page tables.

317
Answer:
(a) Number of page table entries = 238/213 = 225
The bits in an entry = 8 + 23 = 31, round to full bytes  4 bytes for a entry
The size of page table = 225  4 bytes = 128 Mbytes.
(b)
1. To keep a limit register that restricts the size of the page table for a given
process. If the virtual page number becomes larger than the contents of the
limit register, entries must be added to the page table
2. Maintain two separate page tables and two separate limits. The high-order
bit of an address usually determines which segment and thus which page
table to use for that address
3. Apply a hashing function to the virtual address so that the page table data
structure need be only the size of the number of physical pages in main
memory. Such a structure is called an inverted page table
4. Multiple levels of page tables: First level maps large fixed-size blocks of
virtual address space by segment table; Each entry in the page table points to
a page table for that segment
5. Page tables to be paged: allow the page tables to reside in the virtual address
space

4. (a) Suppose a pipelined processor has S stages. If the processor takes 110 ns to
execute N instruction and 310 ns to execute 3N instruction. What are S and N,
respectively? (Assume that the clock rate is 500 MHz and no pipeline stalls
occur)
(b) For a pipelined implementation, assume that one-quarter of the load
instructions are immediately followed by an instruction that uses the result, that
the branch delay on misprediction is 1 clock cycle, and that half of the
branches are mispredicted. Assume that jumps always pay 1 full clock cycle of
delay, so their average time is 2 clock cycles. If the instruction mix is 25%
loads, 10% stores, 52% ALU instructions, 11% branches, and 2% jumps,
please calculate the average CPI.

Answer:
(a) (S – 1) + N = 110/2 = 55 ……...
(S – 1) + 3N = 310/2 = 155……
 N = 50 and S = 6
(b) CPI = 1 + (0.25  0.25  1 + 0.11  0.5  1 + 0.02  1) = 1.1375

318
5. (a) Suppose we have a benchmark that executes in 100 seconds of elapsed time,
where 90 seconds is CPU time and the rest is I/O time. If CPU time improves
by 50% per year for the next five years but I/O time doesn‘t improve, how
must faster will our program run at the end of five years?
(b) Consider program P, which runs on a 1 GHz machine M in 10 seconds. An
optimization is made to P, replacing all instances of multiplying a value by 4
(mult X, X, 4) with two instructions that set x to x + x twice (add X, X; add X,
X). Call this new optimized program P‘. The CPI of a multiply instruction is 4,
and the CPI of an add is 1. After recompiling, the program now runs in 9
seconds on machine M. How many multiplies were replaced by the new
compiler?

Answer:
(a)
After n year CPU time I/O time Elapsed time
0 90 seconds 10 seconds 100 seconds
90
1 = 60 seconds 10 seconds 70 seconds
1 .5
60
2 = 40 seconds 10 seconds 50 seconds
1 .5
40
3 = 27 seconds 10 seconds 37 seconds
1 .5
27
4 = 18 seconds 10 seconds 28 seconds
1 .5
18
5 = 12 seconds 10 seconds 22 seconds
1 .5
100
The improvement in elapsed time is  4.5
22
(b) 109 × 10 −109 × 9 = 109 (less cycles after the optimization). Replace a mult
with two adds, it takes 4 – 2 × 1 = 2 cycles less per replacement. Thus, we
have 109 / 2 = 5 ×108 replacements.

319
95 年中原資工

1. Please define the following term:


a. Finite State Machine
b. Microprogramming
c. Pipeline Hazards
d. Branch Prediction
e. Superscalar
f. Dynamic Multiple Issues Execution (Out-of-order Execution)

Answer:
a. A sequential logic function consisting of a set of inputs and outputs, a next
state function that maps the current state and the inputs to a new state, and an
output function that maps the current state and possibly the inputs to a set of
asserted outputs.
b. A method of specifying control that uses microcode rather than a finite state
representation.
c. The situations in pipeline when the next instruction cannot execute in the
following clock cycle.
d. A method of resolving a branch hazard that assumes a given outcome for the
branch and proceeds from that assumption rather than waiting to ascertain the
actual outcome.
e. An advanced pipelining technique that enables the processor to execute more
than one instruction per clock cycle.
f. A situation in pipelined execution when an instruction blocked from executing
does not cause the following instructions to wait.

2. Identify all of the data dependencies in the following code. Which dependencies
are data hazards that will be resolves via forwarding? Which dependencies are
data hazards that will cause a stall?
add $3 $4 $2
sub $5 $3 $1
lw $6 200($3)
add $7 $3 $6

Answer:
Data dependency (line 1, line 2), (1, 3), (1, 4), (3, 4)
Data hazard (1, 2), (1, 3), (3, 4)
Can be resolved via forwarding (1, 2), (1, 3)
Cause a stall (3, 4)

320
3. Please design a complete datapath of ―Pipelined Processor‖ with (a) Forwarding
Unit (b) Hazard Detection Unit (c) Stall (d) Exception/Interrupt (e) Branch
Perdition, for the following eight instructions: ―add‖, ―sub‖, ―and‖, ―or‖, ―lw‖,
―sw‖, ―beq‖, ―j‖. Then explain how it works.

Answer:

(a) Forwarding Unit: resolve a data hazard by retrieving the missing data element
from internal buffers rather than waiting for it to arrive from programmer-visible
registers or memory.
(b) Hazard Detection Unit: stall and deassert the control fields if the load-use hazard
test is true.
(c) Stall: preserve the PC register and the IF/ID pipeline register from changing.
(d) Exception/Interrupt: a cause register to record the cause of the exception; and an
EPC to save the address of the instruction that caused exception. And flush the
instructions that follow the offending instruction.
(e) Branch Perdition: assume that the branch will not be taken and if the branch is
taken, the instructions that are being fetched must be discards.

321

You might also like