This action might not be possible to undo. Are you sure you want to continue?
1 The suboperations performed in each segment of the pipeline are as follows: R1<--------------- Ai, R2<-------------- Bi R3<--------------- Ci, R4<-------------- Di R5<--------------- R1+R2, R6<---------- R3+R4 R7<----------------R5*R6 Input Ai and Bi Input Ci and Di Add the inputs Multiply
Each segment contains one or two registers and a combinational circuit as shown in the configuration below: Ai Bi Ci Di
R7 Fig.1 Pipeline Configuration for (Ai+Bi)*(Ci+Di) The following table shows the content of all the registers for i= 1 through 6, Segment 1 R2 R3 B1 C1 B2 C2 B3 C3 B4 C4 B5 C5 B6 C6 Segment 2 R5 R6 A1+B1 C1+D1 A2+B2 C2+D2 A3+B3 C3+D3 A4+B4 C4+D4 A5+B5 C5+D5 A6+B6 C6+D6 Segment 3 R7 (A1+B1)*(C1+D1) (A2+B2)*(C2+D2) (A3+B3)*(C3+D3) (A4+B4)*(C4+D4) (A5+B5)*(C5+D5) (A6+B6)*(C6+D6)
Clock Pulse 1 2 3 4 5 6 7 8
R1 A1 A2 A3 A4 A5 A6 -
R4 D1 D2 D3 D4 D5 C6
the maximum speedup that can be achieved is 6 . we assume tn = ktp i. no of segments= 6 ∴ 𝑆 = 6 +100−1 = 100 ×50 5000 1050 = 100 =4. Speedup.e. the speedup ratio = 100/21=4.4 i. n = 100. for the maximum speedup.76 ii. 6+(200-1) = 205 clock cycles QUESTION 9.2 Firstly. S = 𝑛𝑡𝑛 𝑘+𝑛−1 𝑡𝑝 1 T1 2 T2 T1 3 T3 T2 T1 4 T4 T3 T2 T1 5 T5 T4 T3 T2 T1 9 T8 T7 T6 T5 T4 10 11 12 13 T8 T7 T6 T5 T8 T7 T6 T8 T7 T8 No of task. 𝑡𝑛 𝑘𝑡𝑝 𝑆 = 𝑡𝑝 = 𝑡𝑝 = 𝑘 𝑆 = 6 ×10 10 =6 Hence. tp = 10ns.3 Number of segments. n = 200 No of clock cycles = k +(n-1) Therefore.QUESTION 9. k = 6 Number of tasks. we determine the no of clock cycle No of segment k = 6 No of tasks n = 8 Tp. clock cycle time is k+(n-1) 6+(8-1) = 13 clock cycles The Space-time diagram for a 6-segment pipeline is shown below: 6 7 8 1 T6 T7 T8 2 T5 T6 T7 3 T4 T5 T6 4 T3 T4 T5 5 T2 T3 T4 6 T1 T2 T3 It takes 13 clock cycles to process 8 tasks in a 6-segment pipeline QUESTION 9.76 21 Hence. tn = 50ns.
For the non-pipeline system. tn= 100ns. speedup that can be achieved is the number of segments in the pipeline. n = 7.5 Ai 40ns R1 Bi Ci R2 Multiplier 45ns R3 R4 Adder R1 Number of tasks . tn= (40+45+15)ns = 100ns ntn= 7 × 100= 700ns c. k= 3 a. Number of segment. Speedup for 10tasks. Minimum clock cycle time: Tp = (45+5)ns =50ns b. we have. n=10.79 1000 1071 d.e. S= 𝑘𝑡𝑝 𝑡𝑝 = 𝑘 S= 3 . tp= 50ns S= 𝑛𝑡𝑛 𝑘+𝑛−1 𝑡𝑝 = 10×100 3+10−1 50 = 1000 12×50 = 50 63 =0. Max.934 e. i. Speedup of the pipeline for 100tasks: S= 𝑛𝑡𝑛 𝑘+𝑛 −1 𝑡𝑝 = 100×100 3+100−1 50 = 10000 102×50 = =0.QUESTION 9.
t3 = 95ns. M are the mantissas a. . . Interface register delay time tr = 5ns Maximum time delay = t3 Clock cycle tp = t3 + tr = 95 + 5 = 100ns Hence.8. t2 = 30ns.. X2.7 The time delay of the four segments in the pipeline in figure 3 are as follows: t1 = 50ns.. . X3 . t4 = 45ns. C . a. t3 = 95ns.Question 9. . How would you use the floating‐point pipeline adder of fig 9. ..6 to add 100 floating‐point numbers X1 + X2 + X3 + . . . b. . . + X100 ? Solution Let the floating point numbers X1.. . B. t2 = 30ns. c . How can we reduce the total time to about one half of the time calculated in part (a)? Solution Time delays for each of the four segments are: t1 = 50ns.. X100 = M X 2m Where A. . X100 be represented in the form below: X1 = A X 2 a X2 = B X 2 b X3 = C X 2 c . The interface registers delay time tr = 5ns. the minimum clock cycle for each task is 100ns For 100 tasks (100 pairs of numbers) we have: (100 * 100) ns = 10000ns Question 9... How long would it take to add 100 pairs of numbers in the pipeline? b. . t4 = 45ns. . m are the exponents with the assumption that the floating point numbers are binary numbers.
It also stores the next few instructions after the branch target instruction. it searches the associative memory branch target buffer for the address of the instruction. BRANCH TARGET BUFFER (BTB) The branch target buffer is an associative memory included in the fetch segment of the pipeline. the flow of control is unchanged and the next instruction to be executed is the instruction immediately following the current instruction in memory. the next instruction to be executed is an instruction at some other place in memory. the instruction is available directly and pre-fetch continues from the new path. depending on a condition such as CPU flag. also it alters the sequence program flow by loading the program counter with the target address. subroutine calls or GOTO statements. Both are saved until the branch is executed. Pipelined computers employ various hardware techniques to minimize the performance degradation caused by instruction branching.10 Pipeline Processing Four possible hardware schemes that can be used in an instruction pipeline in order to minimize the performance degradation caused by instruction branching What is Branch Instruction? A branch (or jump on some computer architectures such as the PDP-8 and intel x86) is a point in a computer program where the flow of control is altered. The term branch is usually used when referring to a program written in machine code or assembly language. PREFETCH TARGET INSTRUCTION Pre-fetching of the target instruction is way of handling a conditional branching. If the branch condition is successful. a branch instruction can be taken or not taken: if a branch is not taken. the pipeline continues from the branch target instruction. An extension of this procedure is to continue fetching instructions from both places until the branch decision is made. the pipeline shifts to a new instruction stream and stores the target instruction in the .Question 9. and an unconditional branch which is always taken. An instruction that causes a branch. There are usually two forms of branch instruction which are: a conditional branch that can be either taken or not taken. One way of handling a conditional branch is to pre-fetch the target instruction in addition to the instruction following the branch. At that time control chooses the instruction stream of the correct program flow. in a high-level programming language. if taken. If the instruction is not in the branch target buffer. If it is in the branch target buffer. Each entry in the branch target buffer consists of the address of a previously executed branch instruction and the target instruction for that branch. When the pipeline decodes a branch instruction. Branches usually take the form of conditional statements.
LOOP BUFFER A variation of the branch target buffer is the loop buffer. Hence the inner product = 40*40 = 1600 b. a. it will have .19 Flop is the number of floating point operation performed per seconds by a computer system. If the processor of this super computer can calculate floating point operations through a pipeline each cycle time. b. Multiply-add = inner products *40 = 1600 There are 64000 multiply-add operations needed to calculate the product matrix. A typical super computer has a basic 4 cycle time to 20ns.16 Consider the multiplication of two 40*40 matrices using a vector processor. This is a very high speed register file maintained by the instruction fetch segment of the pipeline. The advantage of this scheme is that branch instructions that have occurred previously are readily available in the pipeline without interruption. it is stored in the loop buffer in its entirety. Question 9. Megaflop is the number of millions operations performed by the computer system and Gigaflop is the number of billions operations performed by the computer system. The program can be executed directly without having to access memory until the loop mode is removed by the final branching out. Question 9. The pipeline then begins pre-fetching the instruction stream from the predicted path. including all branches. How many multiply-add operations are needed to calculated the product matrix? Solutions a. There are 40 product terms in each of the inner product. A correct prediction eliminates the wasted time caused by branch penalties.branch target buffer. When a program loop is detected in the program. The product terms in each inner product = 40. How many product terms are there in each inner product and how many inner products must be evaluated. BRANCH PREDICTION A pipeline with branch prediction uses some additional logic to guess the outcome of a conditional branching instruction before it is executed.
The total time required = = 𝑛𝑜 𝑜𝑓 𝑜𝑝𝑒𝑟𝑎𝑡𝑖𝑜𝑛𝑠 𝑛𝑜 𝑜𝑓 𝑝𝑟𝑜𝑐𝑒𝑠𝑠𝑜𝑟𝑠 = 2500ns × 𝑝𝑟𝑜𝑐𝑒𝑠𝑠𝑜𝑟 𝑐𝑦𝑐𝑙𝑒 𝑡𝑖𝑚𝑒 400 4 × 40𝑛𝑠 =4000ns ii) when using a single processor with a clock cycle of 10ns to perform the same task. . there is no difference in the time taken to perform the jobs between the two cases. Hence. The Supercomputer can perform 100 million floating point operation i.the ability to perform 50 to 250 megaflops. the time that it will take this computer to carry out the operation is: 1000 x 250 100 QUESTION 9.20 To perform 400 floating-point operations using four processors with a cycle time of 40ns in each. 250 Gigaflops. 100 megaflops.e. The number of operation is 250 billion floating point operations i.e. the total time required = (400/1)× 10 =4000ns Therefore.
This action might not be possible to undo. Are you sure you want to continue?
We've moved you to where you read on your other device.
Get the full title to continue reading from where you left off, or restart the preview.