The document discusses instruction-level parallelism and its exploitation through techniques like loop unrolling and scheduling. It provides examples of how unrolling and scheduling a sample loop can reduce the number of cycles needed per loop iteration. It also shows how unrolling a loop for a VLIW architecture that can issue multiple operations per cycle can eliminate stalls. However, overly unrolling loops can increase code size and unused functional units in the VLIW model can waste encoding bits.
The document discusses instruction-level parallelism and its exploitation through techniques like loop unrolling and scheduling. It provides examples of how unrolling and scheduling a sample loop can reduce the number of cycles needed per loop iteration. It also shows how unrolling a loop for a VLIW architecture that can issue multiple operations per cycle can eliminate stalls. However, overly unrolling loops can increase code size and unused functional units in the VLIW model can waste encoding bits.
The document discusses instruction-level parallelism and its exploitation through techniques like loop unrolling and scheduling. It provides examples of how unrolling and scheduling a sample loop can reduce the number of cycles needed per loop iteration. It also shows how unrolling a loop for a VLIW architecture that can issue multiple operations per cycle can eliminate stalls. However, overly unrolling loops can increase code size and unused functional units in the VLIW model can waste encoding bits.
Book 1 – Computer Architecture: A Quantitative Approach, Henessy and Patterson,
5th Edition, Morgan Kaufmann, 2012 Chapter Three : Instruction-Level Parallelism and Its Exploitation Example of loop unrolling Show our loop unrolled so that there are four copies of the loop body, assuming R1 – R2 (that is, the size of the array) is initially a multiple of 32, which means that the number of loop iterations is a multiple of 4. Eliminate any obviously redundant computations and do not reuse any of the registers. Example of loop unrolling
• Eliminated three branches and three
decrements of R1 • This loop will run in 27 cycles – 14 instruction issue cycles, 13 stall cycles => 6.75 cycles per element • The performance can be improved further if the unrolled loop is also scheduled Example of loop unrolling and scheduling Show the unrolled loop in the previous example after it has been scheduled.
For unrolled and scheduled:
• Total cycles = 14 • 14/4 = 3.5 cycles/element For unrolled only: 6.75 cycles/element For scheduled only: Total cycles = 7 cycles / element Example of basic VLIW model Suppose we have a VLIW that could issue two memory references, two FP operations, and one integer operation or branch in every clock cycle. Show an unrolled version of the loop x[i] = x[i] + s for such a processor. Unroll as many times as necessary to eliminate any stalls. Ignore delayed branches. Total cycles: 9 Issue rate : 23 operations in 9 clock cycles Efficiency (the percentage of available slots that contained an operation) ≈ 52% This VLIW code sequence requires at least 8 FP registers while same code sequence for the base MIPS processor can use as few as two FP registers Two technical problems with VLIW model: 1. generating enough operations in a straight-line code fragment requires ambitiously unrolling loops, thereby increasing code size. 2. whenever instructions are not full, the unused functional units translate to wasted bits in the instruction encoding THE END