Assignment 2

1.
Consider the following high level code segment: for (i=600; i>0 ; i--) arr [i] = arr [i] + x; For a standard MIPS pipeline with FP unit latencies as specified in textbook (page -75, 4th edition), complete the following tasks: a) Writes the above code segment in MIPS assembly language. Mention your assumptions in the code comments. Ans: The straightforward MIPS code, not scheduled for the pipeline, looks like this: Loop: L.D ADD.D S.D DADDUI F0, 0(R1) F4, F0, F2 F4, 0(R1) R1, R1, #-8 ; F0=array element ; add scalar in F2 ; store result ; decrement pointer ; 8 bytes (per DW) BNE R1, R2, Loop ; branch R1!=R2
In the above code segment, R1 is initially the address of the element in the array with the highest address and F2 contains the scalar value x. Register R2 is precomputed, so that 8(R2) is the address of the last element to operate on.
b) Show necessary stalls/idle clock cycles for parts (a). Ans: First we consider the following figure-
Figure 1: Latencies of FP operations.
Then, using figure 1, necessary stalls/idle clock cycles for parts (a) are shown as follows: Clock cycle issued Loop: L.D stall ADD.D stall stall S.D DADDUI stall BNE R1, R2, Loop F4, 0(R1) R1, R1, #-8 F4, F0, F2 F0, 0(R1) 1 2 3 4 5 6 7 8 9
The loop takes 4 stalls and 9 clock cycles to execute. c) Schedule the code produced in (b). Ans: We can schedule the code produced in (b) as follow -Loop: L.D DADDUI ADD.D stall stall S.D BNE F4, 8(R1) R1, R2, Loop F0, 0(R1) R1, R1, #-8 F4, F0, F2
The stalls after ADD.D are for use by the S.D. d) How many clock cycles are saved using the optimal scheduling? Ans: Here, 2 clock cycles are saved using the optimal scheduling. We obtain only two stalls and thus reduce the time to 7 cycles per iteration. The actual work of operating on the array element takes just 3 (the load, add, and store) of those 7 clock cycles. The remaining 4 clock cycles consist of loop overheadthe DADDUI and BNEand two stalls.
e) We have the number of iterations as a multiple of 3. Show the loop unrolled 3 times (3 copies the loop body). Eliminate any obviously redundant computations and do not reuse any registers. Ans: We unroll the loop so that there are three copies of the loop body, assuming R1 R2 (that is, the size of the array) is initially a multiple of 24 which means that the number of loop iterations is a multiple of 3. According to the questions stated, we eliminate any redundant computations and do not reuse any of the registers. Here is the result after merging the DADDUI instructions and dropping the unnecessary BNE operations that are duplicated during unrolling. Note that R2 must now be set so that 24(R2) is the starting address of the last three elements. Loop: L.D ADD.D S.D L.D ADD.D S.D L.D ADD.D S.D DADDUI BNE F0, 0(R1) F4, F0, F2 F4, 0(R1) F6, -8(R1) F8, F6, F2 F8, -8(R1) F10, -16(R1) F12, F10, F2 F12, -16(R1) R1, R1, #-24 R1, R2, Loop ; drop DADDUI & BNE ; drop DADDUI & BNE ; drop DADDUI & BNE
Without scheduling, each L.D has 1 stall, each ADD.D 2, the DADDUI 1, plus 11 instruction issue cycles thereby making a total of 21 clock cycles or 7 clock cycles for each of the three elements.
f) Schedule the unrolled loop from (e). Ans: Now we will schedule the unrolled loop from (e) as follows: Loop: L.D L.D L.D ADD.D ADD.D ADD.D S.D S.D DADDUI S.D BNE F0, 0(R1) F6, -8(R1) F10, -16(R1) F4, F0, F2 F8, F6, F2 F12, F10, F2 F4, 0(R1) F8, -8(R1) R1, R1, #-24 F12, 8(R1) R1, R2, Loop
g) What is the execution time of the unrolled loop after scheduling? Ans: The code of f) has no stalls. Therefore, the execution time of the unrolled loop has dropped to a total of 11 clock cycles or 3.67 clock cycles per element approximately compared with 9 cycles per element before any unrolling or scheduling and 7 cycles when scheduled but not unrolled.
2.
L.D L.D MUL.D SUB.D DIV.D ADD.D
F2, 32(R1) F6, 44(R2) F8, F6, F4 F0, F6, F2 F10, F8, F2 F2, F0, F6
The above code should be executed out-of-order in the MIPS FP unit using Tomasulos Algorithm. Show all the stages of the Status Table (Instruction status, Reservation stations, Register status) in details till the last execution has completed, but not yet written its result .You shouldnt erase any of the state information. You are allowed to adopt any form of representation as long as it gives all the state information e.g. entries can be shown in that form of a list separated by commas so that previous state information can be traced back. Ans: Using Tomasulos Algorithm, instruction status, reservation stations and register status are given below.
Instruction status
Instruction L.D L.D F2, 32(R1) F6,44(R2) Issue Execute result
MUL.D F8, F6, F4 SUB.D DIV.D ADD.D F0, F6, F2 F10, F8, F2 F2, F0, F6
Reservation stations
Name Load1 Load2 Add1 Add2 Add3 Mult1 Mult2 Busy no yes yes yes no yes yes MUL DIV Regs[F4] Mem[34+ Regs[R1]] Load2 Mult1 LOAD SUB ADD Mem[34+ Regs[R1]] Load2 Add1 Load2 45+ Regs[R2] Op Vj Vk Qj Qk A
Register status
Field Qi F0 Add1 F2 Add2 F4 F6 Load2 F8 Mult1 F10 Mult2 F12 F30
In the first table, all of the instructions have issued, but only the first load instruction has completed and written its result to the CDB. The second load has completed effective address calculation, but is waiting on the memory unit. We use the array Regs[ ] to refer to the register file and the array Mem[ ] to refer to the memory. An operand is specified by either a Q field or a V field at any time. The ADD.D instruction which has a WAR hazard at the WB stage has issued and could complete before the DIV.D initiates.

Assignment 2

Uploaded by

Document Information

Original Description:

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Assignment 2

Uploaded by

Copyright:

Available Formats

1.

Figure 1: Latencies of FP operations.

L.D L.D MUL.D SUB.D DIV.D ADD.D

You might also like