MC9211Computer Organization

Unit 4 Lesson 1 Processor Design

Basic Processing Unit

Connection Between the Processor and the Memory
Memory

MAR

MDR Control

PC

R0 R1 Processo

IR ALU Rn 1

n general purpose registers

Figure 1.2. Connections between the processor and the memory.

Add LOCA, R0 • • • • • • • • • • Transfer the contents of register PC to register MAR Issue a Read command to memory, and then wait until it has transferred the requested word into register MDR Transfer the instruction from MDR into IR and decode it Transfer the address LOCA from IR to MAR Issue a Read command and wait until MDR is loaded Transfer contents of MDR to the ALU Transfer contents of R0 to the ALU Perform addition of the two operands in the ALU and transfer result into R0 Transfer contents of PC to ALU Add 4 to operand in ALU and transfer incremented address to PC

.Overview • Instruction Set Processor (ISP) • Central Processing Unit (CPU) • A typical computing task consists of a series of steps specified by a sequence of machine instructions that constitute a program. • An instruction is executed by carrying out a sequence of more rudimentary operations.

Some Fundamental Concepts .

Fundamental Concepts • Processor fetches one instruction at a time and perform the operation specified. • Processor keeps track of the address of the memory location containing the next instruction to be fetched using Program Counter (PC). • Instruction Register (IR) . • Instructions are fetched from successive memory locations until a branch or a jump instruction is encountered.

PC ← [PC] + 4 • Carry out the actions specified by the instruction in the IR (execution phase). . IR ← [[PC]] • Assuming that the memory is byte addressable. The contents of this location are loaded into the IR (fetch phase).Executing an Instruction • Fetch the contents of the memory location pointed to by the PC. increment the contents of the PC by 4 (fetch phase).

Processor Organization Internal processor bus Control signals PC Instruction decoder and MAR control logic Address lines Memory bus Data lines MDR IR Datapath Y Constant 4 Select Add ALU control lines Sub R0 MUX A ALU B R( n . Single-bus organization of the datapath inside a proc .1) Carry-in TEMP XOR Z Textbook Page 413 Figure 7.1.

Executing an Instruction • Transfer a word of data from one processor register to another or to the ALU. • Perform an arithmetic or a logic operation and store the result in a processor register. . • Store a word of data from a processor register into a given memory location. • Fetch the contents of a given memory location and load them into a processor register.

Input and output gating for the registers in Figure 7.1. .Register Transfers Internal processor bus Riin Ri Riout Yin Y Constant 4 Select MUX A ALU Zin Z B Textbook Page 416 Zout Figure 7.2.

Input and output ating for one reister bit. .3.Register Transfers • All operations and data transfers are controlled by the processor clock. Input and output gating for one register bit. Bus 0 D 1 Q Ri in Riout Q Clock Figure 7. g g Figure 7.3.

R1out. R2out.Performing an Arithmetic or Logic Operation • • • The ALU is a combinational circuit that has no internal storage. The result is temporarily stored in register Z. What is the sequence of operations to add the contents of register R1 to those of R2 and store the result in R3? 1. ALU gets the two operands from MUX and bus. R3in . SelectY. Zin 3. Add. Yin 2. Zout.

4. gister . data into MDR. Connection and control signals for for re MDR.4. issue Read operation. Figure 7. Memory-bus data lines MDRoutE MDRout Internal processo bus MDR MDR inE MDRin Figure 7. Connection and control signalsregister MDR.Fetching a Word from Memory • Address into MAR.

the processor waits until it receives an indication that the requested operation has been completed (MemoryFunction-Completed. • Move (R1). R2 MAR ← [R1] Start a Read operation on the memory bus Wait for the MFC response from the memory Load MDR from the memory bus R2 ← [MDR] . MFC). • To accommodate this.…). memory-mapped I/O.Fetching a Word from Memory • The response time of each memory access varies (cache miss.

MARin. MDRinE. R1out.5. R2in MDRinE Data MFC MDR out Figure 7. . R2 1. MDRout. Read 2. Address Read MR Move (R1). WMFC 3.Step Clock MARin Timing 1 2 3 Assume MAR is always available on the address lines of the memory bus. Timing of a memory Read operation.

Execution of a Complete Instruction • Add (R3). R1 • Fetch the instruction • Fetch the first operand (the contents of the memory location pointed to by R3) • Perform the addition • Load the result into R1 .

Input and output gating for the registers in Figure 7.Architecture Riin Ri Riout Yin Internal processor bus Y Constant 4 Select MUX A ALU Zin Z Zout B Figure 7.1.2. .

Y in . SelectY. PCin . for XOR Add R2.Add.1. MAR in . MAR in . WMF C MDR out . IR in R3out . Select4. R1 PC Internal processor bus Control signals Step 1 2 3 4 5 6 7 Action PCout . R1in . R1 ? Z Figure 7. Read R1out .6. Read.R1. Zin Zout . Add.Execution of a Complete Instruction Add (R3).1) Carry-in TEMP Figure 7. Control sequence execution of the instruction Add (R3). Single-bus organization of the datapath inside a proc . Zin Zout . WMF C MDR out . End Select Add ALU control lines Sub Constant 4 MUX Memory bus Data lines MDR Address lines MAR Instruction decoder and control logic IR Y R0 A ALU B R( n . Y in .

Select4. WMF C Memory bus Data lines MDR Address lines MAR Instruction decoder and control logic IR Y Constant 4 Select Add ALU control lines Sub R0 MDR out . MAR in . Read R1out .1) Carry-in TEMP Figure 7. Control sequence execution of the instruction Add (R3). IR in R3out . Zin Zout . Zin R2out Zout . PCin .6. for XOR Z Figure 7.Execution of a Complete Instruction Add R2. R1 PC Internal processor bus Control signals Step 1 2 3 4 5 6 7 Action PCout . R1in . Y in . Read. MAR in . WMF C MDR out .Add.R1. Y in . Add. Single-bus organization of the datapath inside a proc . SelectY. End MUX A ALU B R( n .1.

• The offset X is usually the difference between the branch target address and the address immediately following the branch instruction.Execution of Branch Instructions • A branch instruction replaces the contents of PC with the branch target address. which is usually obtained by adding an offset X given in the branch instruction. • Conditional branch .

7. Control sequence for an unconditional branch instruction. MAR in . Yin . PCin . Read. Z in Zout . IR in Offset-field-of-IR .Add. WMF C MDR out . PCin . Z in out Z out. End Figure 7. . Add.Execution of Branch Instructions Step Action 1 2 3 4 5 PCout . Select4.

Multiple-Bus Organization Bus A Bus B Incrementer Bus C Textbook Page 424 • Allow the contents of two different registers to be accessed simultaneously and have their contents placed on buses A and B. Three-b us or ganization of the datapath. .8. • Incrementer unit. • ALU simply passes one of ts two input operands unmodified to bus C control signal: R=A or R=B IR MDR MAR Memory b us data lines Address lines Figure 7. R PC Re gister file Constant 4 MUX A ALU B Instruction decoder • Allow the data on bus C to be loaded into a third register during the same clock cycle.

IR in R4outA .R6. MAR in . Read. Control sequence for the instruction. End Figure 7. IncPC WMFC MDR outB . . R5.Multiple-Bus Organization • Add R4.9. R=B. R6 Step Action 1 2 3 4 PCout. SelectA. for the three-bus organization in Figure 7. R6in .8. Add. R5outB . R=B.R5. Add R4.

1) Carry-in TEMP XOR Z Figure 7.1. Single-bus organization of the datapath inside a proc .Exercise Internal processor bus • What is the control sequence for execution of the instruction Add R1. R2 including the instruction fetch phase? (Assume single bus architecture) Control signals PC Instruction decoder and MAR control logic Address lines Memory bus Data lines MDR IR Y Constant 4 Select Add ALU control lines Sub R0 MUX A ALU B R( n .

Hardwired Control .

but with little flexibility. • Two categories: hardwired control and microprogrammed control • Hardwired system can operate at high speed. . the processor must have some means of generating the control signals needed in the proper sequence.Overview • To execute instructions.

Control Unit Organization
Clock CLK Control step counter

External inputs IR Decoder/ encoder Condition codes

Control signals

Figure 7.10. Control unit organization.

Detailed Block Description
Clock CLK Control step counter Reset Step decoder T 1 T2 INS1 INS2 IR Instruction decoder INSm Run Control signals End Encoder Condition codes Tn

External inputs

Figure 7.11. Separation of the decoding and encoding function

Generating Zin
• Zin = T1 + T6 • ADD + T4 • BR + …
Branch T4 Add T6

T1

Figure 7.12. Generation of the Zin control signal for the processor in Figure 7.1.

Generating End
• End = T7 • ADD + T5 • BR + (T5 • N + T4 • N) • BRN +…
Add T7 T5 Branch<0 N N T4 Branch T5

End

Figure 7.13. Generation of the End control signal.

.14. Block diagram of a complete processor .A Complete Processor Instruction unit Integer unit Floating-point unit Instruction cache Data cache Bus interf ce a Processor System b s u Main memory Input/ Output Figure 7.

Microprogrammed Control .

6. microroutine. microinstructionTextbook page430 : MDRout PCin R1in Add Z out IRin End 0 0 0 0 0 0 1 Micro instruction 1 2 3 4 5 6 7 WMFC 0 1 0 0 1 0 0 MAR in Select Read PCout R1out R3out 0 0 0 1 0 0 0 Yin Zin 0 1 0 0 0 0 0 1 0 0 0 0 0 0 1 0 0 1 0 0 0 1 0 0 1 0 0 0 0 0 1 0 0 1 0 0 0 1 0 0 0 0 0 1 0 0 1 0 0 1 0 0 0 0 0 0 1 0 0 0 0 1 0 1 0 0 0 0 1 0 0 1 0 0 0 0 1 0 0 0 0 1 0 0 0 0 0 0 0 0 1 Figure 7. .15 An example of microinstructions for Figure 7.Microprogrammed Control • • Control signals are generated by a program similar to machine language programs. Control Word (CW).

Select4. MAR in .R1. Y in . R1in . End Figure 7.6. Control sequence execution of the instruction Add (R3).Add.Overview Textbook page 421 Step 1 2 3 4 5 6 7 Action PCout . Zin Zout . Y in . for . SelectY. Zin Zout . Add. WMF C MDR out . Read. Read R1out . IR in R3out . WMF C MDR out . MAR in . PCin .

Clock µPC Control store CW Figure 7. Basic organization of a microprogrammed control .Basic organization of a microprogrammed control unit • Control store IR Starting address generator One function cannot be carried out by this simple organization.16.

. PCin .. .. . AddressMicroinstruction 0 1 2 PCout ... Use conditional branch microinstruction.... .. . . .... . Z in Zout . MAR in . then branch to microinstruction0 Offset-field-of-IRout .. IR in 3 Branch to starting address appropriatemicroroutine of .... . ... ..Conditional branch • • The previous organization cannot handle the situation when the control unit is required to check the status of the condition codes or external inputs to choose between alternative courses of action... End Figure 7.. . .. . Microroutine for the instruction Branch<0.. SelectY... .Add. .. . ...... ... Z in Zout .. Add.. .. 25 26 27 If N=0.. Read.17.. .. Y in . Select4. PCin . . . . . WMF C MDRout . . .

Organization of the control unit to allow conditional branching in the microprogram.Microprogrammed Control External inputs IR Starting and branch address generator Condition codes Clock µPC Control store CW Figure 7. .18.

and many signals are mutually exclusive.Microinstructions • A straightforward way to structure microinstructions is to assign one bit position to each control signal. this is very inefficient. • The length can be reduced: most signals are not needed simultaneously. • All mutually exclusive signals are placed in the same group in binary coding. . • However.

Microinstruction F1 Partial Format for the Microinstructions F2 F3 F4 F5 F2 (3 bits) 000: No transfer 001: PC in 010: IR in 011: Zin 100: R0 in 101: R1 in 110: R2 in 111: R3 in F3 (3 bits) F4 (4 bits) F5 (2 bits) 00: No action 01: Read 10: Write 000: No transfer 0000: Add 001: MARin 0001: Sub 010: MDRin 011: TEMP in 1111: XOR 100: Yin 16 ALU functions F1 (4 bits) 0000: No transfer 0001: PC out 0010: MDR out 0011: Z out 0100: R0 out 0101: R1 out 0110: R2 out 0111: R3 out 1010: TEMP out 1011: Offset out F6 F7 F8 What is the price paid for this scheme? Require a little more hardware F6 (1 bit) 0: SelectY 1: Select4 F7 (1 bit) 0: No action 1: WMFC F8 (1 bit) 0: Continue 1: End Figure 7. An example of a partial format for field-encoded microinstructions .19.

Further Improvement • Enumerate the patterns of required signals in all possible microinstructions. Textbook page 434 • Vertical organization • Horizontal organization . Each meaningful combination of active control signals can then be assigned a distinct code.

Microprogram Sequencing • If all microprograms require only straightforward sequential execution of microinstructions except for branches. autodecrement. two disadvantages: Having a separate microroutine for each machine instruction results in a large total number of microinstructions and a large control store. • Example: Add src. letting a µPC governs the sequencing would be efficient. autoincrement. and indexed (with indirect . Rdst • Four addressing modes: register. Longer execution time because it takes more time to carry out the required branches. • However.

Bit-ORing .Textbook page 436 .Wide-Branch Addressing .WMFC .

µPC ← [IR 10] ⋅ [IR9] ⋅ [IR8]} 5. MARin. Zin out Zout.21. Select Add.inZ out Zout. Read. WMFC in MDRout. WMFC µ 0 MDRout. MARin .Mode Contents of IR OP code 0 1 11 10 0 87 Rsrc 4 3 Rdst 0 Address (octal) 000 001 002 003 121 122 123 170 171 172 173 Microinstruction Textbook page 439 4.Rdst. PC . . Read. End in Figure 7. MARin. out Zout. Rsrc in µBranch { PC← 170. Rdst . Yin.9]. Yin Rdst . Select4. Read. Zin . PC .4 3 Rsrc . Note:Microinstruction at location 170 is not executed for this addressing mode. Microinstruction for Add (Rsrc)+. µ µPC ← [IR10. WMFC MDRout. SelectYAdd.µPC ← [IR8]}. Add. IRin µBranch { PC← 101 (from Instruction decoder).

• Cons: additional bits for the address field (around 1/6) . few limitations in assigning addresses to microinstructions. • Pros: separate branch microinstructions are virtually eliminated. which perform no useful operation in the datapath. • A powerful alternative approach is to include an address field as a part of every microinstruction to indicate the location of the next microinstruction to be fetched.Microinstructions with NextAddress Field • The microprogram we discussed requires several branch microinstructions.

Microinstruction-sequencing organization.Microinstructions with NextAddress Field IR External Inputs Condition codes Decoding circuits µA R Control store Next address µI R Microinstruction decoder Control signals Figure 7.22. .

Format for microinstructions in the example of Section 7 .Microinstruction F0 F1 F2 F3 F0 (8 bits) F1 (3 bits) F2 (3 bits) 000: No transfer 001: PC in 010: IR in 011: Zin 100: Rsrc in 101: Rdst in F3 (3 bits) 000: No transfer 001: MARin 010: MDRin 011: TEMP in 100: Yin Address of next 000: No transfer microinstruction 001: PC out 010: MDRout 011: Zout 100: Rsrc out 101: Rdst out 110: TEMP out F4 F5 F6 F7 F4 (4 bits) 0000: Add 0001: Sub 1111: XOR F5 (2 bits) 00: No action 01: Read 10: Write F6 (1 bit) 0: SelectY 1: Select4 F7 (1 bit) 0: No action 1: WMFC F8 F9 F10 F8 (1 bit) 0: NextAdrs 1: InstDec F9 (1 bit) 0: No action 1: OR mode F10 (1 bit) 0: No action 1: ORindsrc Figure 7.23.

21 usin next-microinstruction address field. Implementation of the microroutine of Figure 7.24. Figure 7.Implementation of the Microroutine Octal address 0 0 0 0 0 0 0 0 0 1 2 3 0 0 0 0 0 0 0 0 0 0 0 0 F0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 0 1 0 1 0 0 0 0 0 F1 0 1 1 0 1 1 0 0 F2 01 00 01 00 1 1 0 0 F3 0 1 0 0 0 0 0 0 1 0 0 0 0 0 0 0 F4 0 0 0 0 0 0 0 0 0 0 0 0 F5 F6 F7 F8 F9 F10 01 00 00 00 1 0 0 0 1 0 0 0 0 0 0 1 0 0 0 1 1 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 121 122 1 1 1 1 7 7 7 7 0 1 2 3 0 1 0 1 0 0 1 0 1 0 0 01 1 0 0 1 0 0 0 0 01 0 1 1 1 1 0 0 0 0 1 1 10 0 0 0 0 0 0 0 0 00 0 0 0 0 1 1 1 0 1 1 1 0 1 1 1 0 1 1 1 0 0 0 0 0 0 1 1 0 1 0 1 0 0 0 1 0 1 1 0 1 0 0 1 1 00 00 01 10 0 0 1 1 0 1 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 01 00 00 00 Figure 7.23 for encoded signa (See .

Some details of the control-signal-generating circuitry. .25.R15in R15out R0in R0out Decoder Decoder IR External inputs Condition codes Rsrc Rdst InstDec out Decoding circuits ORmode ORindsrc µA R Control store Next address Rdstout Rdstin Rsrcout Rsrcin F1 F2 F8 F9 F10 Microinstruction decoder Other control signals Figure 7.

bit-ORing .

Pipelining .

• Pipelining improves system performance in terms of throughput.Overview • Pipelining is widely used in modern processors. . • Pipelined organization requires sophisticated compilation techniques.

Basic Concepts .

• Arrange the hardware so that more than one operation can be performed at the same time. • In the latter way. .Making the Execution of Programs Faster • Use faster circuit technology to build the processor and the main memory. the number of operations performed per second is increased even though the elapsed time needed to perform any one operation is not changed.

dry. and fold A B C D • Washer takes 30 minutes • Dryer takes 40 minutes • “Folder” takes 20 minutes . Dave each have one load of clothes to wash.Traditional Pipeline Concept • Laundry Example • Ann. Cathy. Brian.

Traditional Pipeline Concept 6 PM 7 8 9 Time 30 A 40 20 30 40 20 30 40 20 30 40 20 10 11 Midnight B C D • Sequential laundry takes 6 hours for 4 loads • If they learned pipelining. how long would laundry take? .

5 hours for 4 loads .Traditional Pipeline Concept 6 PM T a s k O r d e r 7 8 9 Time 30 A B C D 40 40 40 40 20 10 11 Midnight • Pipelined laundry takes 3.

it helps throughput of entire workload • Pipeline rate limited by slowest pipeline stage • Multiple tasks operating simultaneously using different resources • Potential speedup = Number pipe stages • Unbalanced lengths of pipe stages reduces speedup • Time to “fill” pipeline and time to “drain” it reduces speedup .Traditional Pipeline Concept 6 PM 7 8 9 Time T a s k O r d e r 30 A 40 40 40 40 20 B C D • Pipelining doesn’t help latency of single task.

Use the Idea of Pipelining in a Computer Fetch + Execution Time I1 F E F I2 E F I3 Clock cycle 1 1 2 2 3 Time 1 2 3 4 E 3 Instruction I1 I2 F1 E1 F2 E2 F3 (c) Pipelined execution E3 (a) Sequential execution Interstage buffer B1 Instruction fetch unit Execution unit I3 (b) Hardware organization Figure 8. Basic idea of instruction pipelining. .1.

Use the Idea of Pipelining in a Computer Time Clock cycle 1 2 3 4 5 6 7 Instruction I1 I2 I3 I4 Fetch + Decode + Execution + Write F1 D1 F2 E1 D2 F3 W1 E2 D3 F4 W2 E3 D4 W3 E4 W4 (a) Instruction execution divided into four steps Interstage uffers b F : Fetch instruction B1 D : Decode instruction and fetch operands B2 E: Execute operation B3 W : Write results (b) Hardware organization Textbook page: 457 Figure 8.2. A 4-stage pipeline. .

pipeline is almost useless. we have cache.Role of Cache Memory • Each pipeline stage is expected to complete in one clock cycle. • Fortunately. • The clock period should be long enough to let the slowest pipeline stage to complete. if each instruction needs to be fetched from main memory. • Faster stages can only wait for the slowest one to complete. . • Since main memory is very slow compared to the execution.

this is not true.Pipeline Performance • The potential increase in performance resulting from pipelining is proportional to the number of pipeline stages. • Unfortunately. and there is no interruption throughout program execution. . this increase would be achieved only if all pipeline stages require the same time to complete. • However.

3. Effect of an eecution operation taking more than one clock c x ycle.Pipeline Performance Time Clock c ycle Instruction I1 I2 I3 I4 F1 D1 F2 E1 D2 F3 D3 F4 W1 E2 W2 E3 D4 W3 E4 W4 E5 1 2 3 4 5 6 7 8 9 I5 F5 D5 Figure 8. .

• Data hazard – any condition in which either the source or the destination operands of an instruction are not available at the time expected in the pipeline. . • Any condition that causes a pipeline to stall is called a hazard. and the pipeline stalls. • Structural hazard – the situation when two instructions require the use of a given hardware resource at the same time. So some operation has to be delayed.Pipeline Performance • The previous pipeline is said to have been stalled for two clock cycles. • Instruction (control) hazard – a delay in the availability of an instruction causes the pipeline to stall.

.4. Pipeline stall caused by a cache miss in F2.Pipeline Performance Time Instruction hazard Clock cycle Instruction I1 I2 I3 1 2 3 4 5 6 7 8 9 F1 D1 E1 F2 W1 D2 F3 E2 D3 W2 E3 W3 (a) Instruction execution steps in successive clock cycles Time 9 Clock cycle Stage F: Fetch D: Decode E: Execute W: Write 1 2 3 4 5 6 7 8 F1 F2 D1 F2 idle E1 F2 idle idle W1 F2 idle idle idle F3 D2 idle idle D3 E2 idle E3 W2 W3 Idle periods – stalls (bubbles) (b) Function performed by each processor stage in successive clock cycles Figure 8.

R2 Time Clock cycle Instruction I1 I 2 (Load) I3 I4 F1 D1 F2 E1 D2 F3 W1 E2 D3 F4 M2 E3 D4 W2 W3 E4 1 2 3 4 5 6 7 I5 F5 D5 Figure 8.Pipeline Performance Structural hazard Load X(R1). .5. Effect of a Load instruction on pipeline timing.

• Throughput is measured by the rate at which instruction execution is completed. pipelining does not result in individual instructions being executed faster. . rather. it is the throughput that increases.Pipeline Performance • Again. • We need to identify all hazards that may cause the pipeline to stall and to find ways to minimize their impact. • Pipeline stall causes degradation in pipeline performance.

and figure out the total cycles needed for the four instructions to complete. . Pls draw the figure for 4-stage pipeline. the I2 takes two clock cycles for execution.Quiz • Four instructions.

Data Hazards .

R4. they must be executed sequentially in the correct order. • Hazard occurs A←3+A B←4×A • No hazard A←5×C B ← 20 + C • When two operations depend on each other. R3.Data Hazards • We must ensure that the results obtained when instructions are executed in a pipelined processor are identical to those obtained when the same instructions are executed sequentially. R4 Add R5. • Another example: Mul R2. R6 .

2 1 Figure 8. Pipeline stalled by data dependenc y between D and W . .6.6. Pipeline stalled by data dependency between D2 and W1.Data Hazards Time Clock cycle Instruction I 1 (Mul) I 2 (Add) I3 I4 F1 D1 F2 E1 D2 F3 W1 D2A E2 D3 F4 W2 E3 D4 W3 E4 W4 1 2 3 4 5 6 7 8 9 Figure 8.

Operand Forwarding • Instead of from the register file. • A special arrangement needs to be made to “forward” the output of ALU to the input of ALU. . the second instruction can get data directly from the output of ALU after the previous instruction is completed.

SRC2 E: Execute (ALU) RSLT W: Write (Register file) Forwarding path (b) Position of the source and result registers in the processor pipeline Figure 8.7.Source 1 Source 2 SRC1 SRC2 Register file ALU RSLT Destination (a) Datapath SRC1. . Operand forw arding in a pipelined processor .

R4 NOP NOP I2: Add R5. .Handling Data Hazards in Software • Let the compiler detect and handle the hazard: I1: Mul R2. R6 • The compiler can reorder the instructions to perform some useful work during the NOP slots. R4. R3.

Side Effects • The previous example is explicit and easily detected. R4 • Instructions designed for execution on pipelined hardware should have few side effects. R3 AddWithCarry R2. the instruction is said to have a side effect. • When a location other than one explicitly named in an instruction as a destination operand is affected. . (Example?) • Example: conditional code flags: Add R1. • Sometimes an instruction changes the contents of a register other than the one named as the destination.

Instruction Hazards .

the pipeline stalls.Overview • Whenever the stream of instructions supplied by the instruction fetch unit is interrupted. • Cache miss • Branch .

An idle c ycle caused by a branch instruction.Unconditional Branches Time Clock cycle Instruction I1 I 2 (Branch) I3 Ik I k+1 F1 E1 F2 E2 F3 X Execution unit idle 1 2 3 4 5 6 Fk Ek Fk+1 Ek+1 Figure 8. .8.

Branch timing. .9.Reducing the penalty I4 Ik I k+1 (a) Branch address computed inecute stage Ex Time Clock cycle I1 I 2 (Branch) I3 Ik I k+1 1 F1 2 D1 F2 3 E1 D2 F3 X Fk Dk Fk+1 Ek Wk 4 W1 5 6 7 D k+1 Ek+1 (b) Branch address computed in Decode stage Figure 8.Branch Timing Clock cycle I1 1 2 3 4 F1 D1 F2 E1 D2 F3 W1 E2 D3 F4 I 2 (Branch) I3 Time 5 6 7 8 X X Fk Dk Fk+1 Ek Dk+1 Wk Ek+1 .Branch penalty .

Use of an instruction queue in the hardware organization of Figure 8.2b.10. .Instruction Queue and Prefetching Instruction fetch unit Instruction queue F : Fetch instruction D : Dispatch/ Decode unit E : Execute instruction W : Write results Figure 8.

• Branch instructions represent about 20% of the dynamic instruction count of most programs.Conditional Braches • A conditional branch instruction introduces the added hazard caused by the dependency of the branch condition on the result of a preceding instruction. • The decision to branch cannot be made until the execution of that instruction has been completed. .

Super Scalar Architecture F: Instruction Fetch Unit Instruction Queue Floating Point Unit Dispatch Unit W: Write Results Integer Unit .

Super Scalar Operation • Equip processor with multiple processing units – several instructions can be executed in the same clock cycle – multiple issue processor • Throughput can be > 1 instruction / cycle • Compiler should interleave floating point and integer instructions • Out-Of-Order Execution should be taken care .

Sign up to vote on this title
UsefulNot useful