EEL 4768: Computer Architecture: Instruction Level Parallelism (ILP)

4/8/2016
EEL 4768: Computer Architecture
Instruction‐Level Parallelism (ILP)
Department of Electrical and Computer Engineering
University of Central Florida
Instructor: Zakhia (Zak) Abichar
Source: “Computer Organization and Design”, Patterson and Hennessy, 4th Ed Revised, Section 4.10
• Definition: Instruction‐Level Parallelism (ILP) is a general concept that
implies multiple instructions execute in parallel
• There are two main ways to achieve ILP
Pipeline:
• In the 5‐stage pipeline that we’ve seen, there are up to five instructions
in the datapath
• Therefore, the pipelined datapath is a form of ILP
• The overlap in the pipelined datapath is called ‘partial overlap’ since
each instruction is using a different part of the pipeline (e.g: we can’t
have two instructions using the ALU in the same clock cycle since
there’s one ALU)
1
4/8/2016
Multiple‐issue datapath
•Another way to achieve ILP is to have multiple pipelines
•Therefore, multiple instructions can be issued (started) in a clock cycle
•In the figure below, the datapath has two pipelines, so it can issue two
new instructions in a clock cycle
•The datapath in the figure is called a ‘two‐way multiple issue CPU’
•The two pipelines contain at most 10 instructions at a time
•The two pipelines are synchronized to resolve data dependences
Two pipelines
in the CPU
CPI (Clock Per Instruction) vs. IPC (Instruction Per Clock)
• A single‐issue pipelined datapath can finish one instruction per clock
cycle when there are no stalls
• Therefore, it can achieve: CPI = 1
• Note: the single‐cycle datapath can achieve a CPI=1, but its clock cycle is
much longer than the pipelined datapath’s
• A multiple‐issue pipelined datapath can finish multiple instructions per
clock cycle
• The two‐way multiple issue CPU can finish two instructions per cycle,
therefore, it can achieve: CPI = 0.5
• In this case, instead of measuring CPI (Clocks Per Instruction), we
measure Instructions Per Clock (IPC)
• Therefore, the two‐way multiple issue CPU can do: IPC = 2
2
4/8/2016
Example
A 4‐way multiple‐issue CPU runs on a 3 GHz clock frequency. The pipeline
in this CPU has five stages. Find the CPI and IPC. How many instructions can
this CPU do per second? How many instructions can be in the datapath
simultaneously?
•Since the CPU is 4‐way multiple issue (it has 4 pipelines), it can finish 4
instructions per cycle. Therefore: IPC = 4
•This implies that: CPI = 1/4 = 0.25
•Clock rate is 3 GHz; there are (3*109) clock cycles per second; each cycle
finishes 4 instructions, therefore, the CPU can do: (4*3*109) instructions,
which is 12 billion instructions per second
•The CPU has four pipelines and each one is 5‐stage; therefore, the CPU
can have 20 instructions at any point in time
What assumption are we making in this example? 5
Multiple‐Issue CPU
• Most of today’s advanced CPUs are multiple‐issue
– Some basic CPUs used in embedded systems might not be multiple‐issue
– But CPUs used in desktops, servers, smartphones are multiple‐issue
• Today’s CPUs attempt to issue from 3 to 6 instructions per cycle
• However, due to data dependences, there is some limitation on which
instructions can execute in parallel
• Stalls also happen, so this might slow down the CPU
• In the example (previous slide), we assumed that the CPU is always full
when we got the answers (IPC=4, 12 billions instructions per second,
and 20 instructions in the pipeline)
• Practically, due to the data dependences and hazards, the performance
is a bit slower than these numbers indicate
3
4/8/2016
• The figure below is one way to represent a 4‐way multiple issue CPU
• In each clock cycle, up to 4 instructions can be issued
• Each box is called an ‘issue slot’
• In the first cycle, we only issued 3 instructions since we can’t find a 4th
one that doesn’t have a dependence with the other 3 instructions
• Therefore, not all the four issue slots can always be filled
• In one slot in the figure, there’s a stall and no instruction was issued at
all time
Unused slot:
Used slot:
7
• What are the tasks needed to support multiple‐issue CPUs?
• We should determine which instructions can execute in parallel
• Instructions with data dependences can’t run in parallel
– They could have partial overlap and we might possibly need to do forwarding
Who can do this task?
• One approach puts the compiler in charge
• The compiler groups the instructions during compilation, hence it’s the
static approach
• Another approach puts the hardware in charge
• The hardware groups the instruction at run‐time, hence it’s the dynamic
approach
8
4
4/8/2016
Summarizing points
Compiler approach Hardware approach
Static also called VLIW (Very Long Dynamic also called superscalar
Instruction Word)
(+) Analyze the code multiple Hardware looks at a window of
times at compilation and package ~100 instructions and runs all
instructions ready
(‐) Can’t see run‐time event, eg: (+) Can see run‐time events and
cache missed, exceptions adjust instruction execution
* Hardware allowed to change (+) Code runs fast on all hardware
packages since the hardware is in charge
(‐) Compile code for a specific * Usually the preferred approach
hardware
Speculation: execute code even though it’s not sure this code must be
executed (is done instead of idling) 9
• The compiler is a good candidate for grouping the instructions
• This is because the compiler can traverse the code and analyze it
multiple times during compilation
• The compiler can group instructions into issue slots, a process referred
to as packaging
• The compiler’s strength is the ability to see all the code at compile time
• However, the compiler can’t see events that happen at run‐time
• Events of interest are (cache misses, exceptions, branch stalls) since
they may affect the packaging
• Accordingly, the CPU is allowed to alter the packaging that the compiler
made to adapt to run time events
10
5
4/8/2016
• The compiler can easily see the hazard in (Code 1) and will create
enough separation between the load and the add
Code 1 Code 2
lw $t0, 12($s0) lw $t0, 0($s0)
add $a0, $a0, $t0 add $t2, $t3, $t4
add $t2, $t2, $t5
add $t2, $t2, $t6
sub $t2, $t2, $t0
• However, (Code 2) might seem to be fine since the dependence on t0 is
separated by multiple instructions
• But, if $t0 misses in the cache and it took 100s of cycles to access, the
sub instruction would have to be stalled
• The hardware is better positioned to observe the cache miss and re‐
scheduling the sub accordingly
11
• There are other situations that the hardware is better positioned to
observe than the compiler
• Some are (exception events), (stalls caused by memory read or write),
(branch instructions stalls)
Conclusion
• Even if the compiler is primarily in charge, the compiler and the
hardware collaborate
• (1) The compiler deals with the hazards that can be seen at compile
time, then the hardware deals with the other hazards that are observed
at run time
• (2) the compiler packages the instructions, then the hardware might
have to re‐package them
12
6
4/8/2016
• Another approach is to have the hardware primarily in charge
• The hardware groups the instructions
• One limitation of the hardware is that it cannot look at all the code and
analyze
• It usually looks at a window of ~100 instructions and analyze them
• Therefore, the compiler will arrange the instructions in a ‘beneficial
order’ by separating the dependences
• This help the hardware in generating an efficient grouping of the
instructions
Conclusion
• Whether the compiler or the hardware is in charge, they always need to
collaborate in the multiple‐issue CPU
13
• The compiler‐based approach is called the static multiple‐issue CPU
• The hardware‐based approach is called the dynamic multiple‐issue CPU
14
7
4/8/2016
Speculation
• Speculation is the mechanism of executing some instructions so as not
to stall the CPU even if we’re not sure that these instructions are
supposed to be executing
• The ‘predict branch untaken’ strategy that we’ve seen is a speculation
• We speculated that the branch is untaken and allowed the ‘add’ in the
pipeline
beq $t0, $t1, L1
add $s1, $s6, $s7
...
L1:
sub $s4, $s5, $s6
• In the multiple‐issue pipeline, we allow multiple instructions in the
pipeline speculatively
15
Speculation
• In the code below, t0 could miss in the cache and it would be 100s of
cycles before we get the data
• While waiting for t0, we can speculate that the branch will be taken and
start processing the code at L1
• Once we get t0, if the speculation was wrong, the code that was
executed at L1 should be canceled out
lw $t0, 0($s0)
beq $t0, $t1, L1
add
sub
and
or
...
L1:
...
• The speculation can be either way; alternatively, we can speculate the
branch will not be taken 16
8
4/8/2016
Speculation
• One popular speculation is switching a load and a store
• If the load misses in the cache, can we execute the store before it?
• If (s0=s1), we cannot; if they’re different values, then we can
lw $t0, 0($s0)
... # What if address $s0 = $s1
sw $t1, 0($s1)
• We might speculate that the addresses are different and switch the
order
• Later, if we discover that the addresses are the same, we should take
some action to correct the result
17
Speculation
• Speculation can be done in the static and dynamic approaches
Speculation in the static approach (by the compiler)
• If the compiler executes a code speculatively, the compiler inserts
additional code that checks if the speculation was correct
• If it wasn’t, the additional code corrects the situation using a fix‐up
routine
Speculation in the dynamic approach (by the hardware)
• When the hardware runs some code speculatively, the results produced
by the code are saved in a buffer (not in the actual registers or memory
locations)
• If the speculation was found to be correct, the results in the buffer are
committed to the actual register/memory locations
• Otherwise, the buffered speculated results are deleted
18
9
4/8/2016
Speculation
• Speculation can affect the behavior of exceptions
• A code that is executed speculatively might raise exceptions that are
not supposed to be raised
• Example: A ‘load’ executed in speculation uses an illegal address and

raises an exception
• The speculation turned out to be wrong, which means the exception
should not be raised since the load wasn’t supposed to execute
• In compiler‐based speculation, special code is inserted to make the
exceptions ignored until we’re sure the exception is supposed to
happen
• In hardware‐based speculation, the exceptions are buffered and not
serviced until the speculation result is known
19
Speculation
Example
• In the code below, let’s assume that (y=0) and (i=‐1)
• Since i=‐1, trying to access A[i] would cause an exception
• However, since y=0, the statement below shouldn’t be done and no
exception should normally occur
With speculation
• If ‘y’ misses in the cache, the CPU might speculate that the condition is
true (false speculation since y=0) and process the array access
• The array access throws an exception since the index is ‐1
• This exception shouldn’t happen under correct execution (since y=0)
C Code
if(y==1) // ‘y’ misses in the cache
A[i]=A[i]+1;
20
10
4/8/2016
Static Multiple Issue
• In the static‐multiple issue CPU, the compiler groups instructions into an
‘issue packet’
• In the figure below, the first issue packet contains 3 instructions
• We can think of an issue packet as a very long instruction
• The issue packet with 3 instructions is like a large instruction with 96 bits
(32‐bit instruction x 3)
• Therefore, the issue packet is called a Very Long Instruction Word’
(VLIW)
• And, static multiple issue CPUs are called VLIW CPUs
These multiple instructions
issued together can be
time
Unused slot: viewed as a ‘Very Long
Used slot: Instruction Word’ (VLIW)
21
• Usually, there is a restriction on the type of instructions that issue
simultaneously in a VLIW
Example:
• A 4‐way multiple issue CPU has two ALUs only
• In the VLIW of 4 instructions, two can be R‐type instructions (that use
the ALU) and the other two can be of another type that don’t use the
ALU
• Such a setting limit the resources needed in the datapath
• It may also make sense practically since, in the average case, we don’t
issue four R‐type instructions simultaneously
22
11
4/8/2016
• The table represents a static two‐issue MIPS CPU
• Such a design is used in MIPS embedded CPUs
• An issue packet contains (an ALU or branch) and (a load or store)
• We can’t have two ‘loads’ or two R‐types in one issue packet
• This condition reduces the number of components in the datapath
• In every clock cycle, the CPU fetches 64 bits of instruction that are
aligned on the 64‐bit boundary
23
24
12
4/8/2016
• The datapath figure corresponds to the static two‐issue MIPS CPU
• This is the VLIW content:
VLIW ALU or Branch Load or Store
How many register could we want to read?
• At most (R‐type: 2 and Store: 2), four registers
• Therefore, the register file is modified to allow reading four registers
How many register could we want to write?
• At most (R‐type: 1 and Load: 1), two registers
• Therefore, the register file is modified to allow writing two registers
How many ALU computations in the EX stage?
• The ALU processes the (R‐type or Branch) and the extra adder processes
the (load or store)
25
• The two‐issue CPU can improve the performance by a factor up to 2
over the regular five‐stage pipeline since we’re overlapping more
instructions
• However, overlapping more instructions makes the data dependences
more complicated
• When an instruction computes a result, the new few instructions may
not be able to use this result due to dependence
• This is referred to as the use latency
• The use latency can be measured in ‘cycles’ or ‘instructions’
26
13
4/8/2016
In the regular five‐stage pipeline (single‐issue)…
• The instruction following a ‘load’ cannot use the result of the ‘load’
• This is a use latency of 1 cycle or 1 instruction
• The instruction following an R‐type can use the result of the R‐type since
forwarding is applied
• Therefore, the R‐type had a use latency of zero
Use latency of
lw $t0, 0($s0) 1 cycle or
// next instruction cannot use $t0 1 instruction
...
...
add $s0, $s1, $s2
Use latency of zero
// next instruction okay to use $s0
27
• What happens to the use latency in the two‐issue CPU?
Cannot use the result of
the R‐type in Line 1
Line 1 ALU or Branch Load or Store

Line 2 ALU or Branch Load or Store
Neither can use the result of the
Load in Line 1
• R‐type use latency is 1 instruction
• Load use latency is 1 cycle or 2 instructions
• The use latency values have increased!
28
14
4/8/2016
Observation
• Overlapping more instructions allows to improving the performance
• But now a hazard affects more of the subsequent instructions
• We need to find more mechanisms to improve parallelism
• In the static multiple issue CPUs, we rely on the compiler to do this
29
• The MIPS code below loads a value from an array into $t0, then adds
$s2 to it and stores the result in the same array location
• The loop stops when the address becomes zero (s1 is initialized as the
address of the last element in the array and goes down to zero)
• How can this code be scheduled on the two‐issue CPU?
Loop: lw $t0, 0($s1) # $t0=array element

addu $t0, $t0, $s2 # add scalar in $s2
sw $t0, 0($s1) # store result
addi $s1, $s1,–4 # decrement pointer
bne $s1, $zero, Loop # branch $s1!=0
30
15
4/8/2016
• This is one way to schedule the code on the two‐issue CPU
• Remember, the first instruction in the VLIW is ALU or branch and the
second instruction is load or store
Loop: lw $t0, 0($s1) # $t0=array element
addu $t0, $t0, $s2 # add scalar in $s2
sw $t0, 0($s1) # store result
addi $s1, $s1,–4 # decrement pointer
bne $s1, $zero, Loop # branch $s1!=0
ALU/branch Load/store cycle

Loop: nop lw $t0, 0($s1) 1
addi $s1, $s1,–4 nop 2
addu $t0, $t0, $s2 nop 3
bne $s1, $zero, Loop sw $t0, 4($s1) 4
• This schedule achieves an IPC (Instruction‐Per‐Cycle) value of 5/4=1.25
out of a maximum IPC of 2; there are too many nops!
31
• How can we schedule the previous code to get a better performance?
• Observation: Usually in a loop code, there is a lot of dependence;
however, in the previous code there are no dependences between
different iterations of the loop
– Every iteration of the loop reads an array element, adds $s2 to it and stores it back
– Therefore, different iterations use different data
• A technique that can be used here is called ‘loop unrolling’
• Let’s unroll the loop so we have fewer iterations in the loop code; the
total number of iterations decreases; the loop code becomes larger but
it’s possible to overlap more instructions
32
16
4/8/2016
• The transformation below shows the loop unrolling technique
• The original loop has four instructions and iterates 100 times
• The loop code is written twice, back‐to‐back, making the loop twice as
large, but now iterates 50 times
• The two codes are logically equivalent
Loop 50 times
Instruction 1
Loop 100 times Instruction 2
Instruction 1 Instruction 3
Instruction 2 Instruction 4
Instruction 3 Instruction 1’
Instruction 4 Instruction 2’
Instruction 3’
Instruction 4’
33
• This is the loop code after it’s been unrolled
• Now each loop iteration processes four elements in the array
• The address register $s1 is decreased by 16 bytes to jump over 4 words
Loop: lw $t0, 0($s1)

Most of the instructions use the
addu $t0, $t0, $s2
register $t0 which means we
sw $t0, 0($s1)
can’t reorder the instructions.
lw $t0, -4($s1)
addu $t0, $t0, $s2
However, there is no real
sw $t0, -4($s1)
dependence between the 4
lw $t0, -8($s1)
operations in the loop’s code.
addu $t0, $t0, $s2
sw $t0, -8($s1)
A technique called ‘register
lw $t0, -12($s1)
renaming’ allows us to deal
addu $t0, $t0, $s2
with this.
sw $t0, -12($s1)
addi $s1, $s1,–16
bne $s1, $zero, Loop
34
17
4/8/2016
• Instead of using $t0 for the four iteration codes, we’re also using the
registers $t1, $t2 and $t3
• This is the ‘register renaming’ procedure

addu $t0, $t0, $s2
sw $t0, 0($s1) Register renaming is used
lw $t1, -4($s1) when there is a dependence
addu $t1, $t1, $s2 on the name of the register
sw $t1, -4($s1) but there is no real
lw $t2, -8($s1) dependence on the data
addu $t2, $t2, $s2 between the instructions.
sw $t2, -8($s1)
lw $t3, -12($s1)
addu $t3, $t3, $s2
sw $t3, -12($s1)
addi $s1, $s1,–16
35
• The code on the previous slide had ‘lw’ followed by ‘addu’ with a data
dependency; the code also had dependency between the ‘addu’ and the
‘store’
• ‘nops’ can be avoided by reordering the instructions as shown below
lw $t1, -4($s1)
lw $t2, -8($s1)
lw $t3, -12($s1)
This code can benefit more
addu $t0, $t0, $s2
from instruction overlapping
addu $t1, $t1, $s2
and achieves a better
addu $t2, $t2, $s2
performance
addu $t3, $t3, $s2
sw $t0, 0($s1)
sw $t1, -4($s1)
sw $t2, -8($s1)
sw $t3, -12($s1)
addi $s1, $s1,–16
36
18
4/8/2016
• This is how the unrolled loop code can be scheduled on the two‐issue CPU
• The offset in the ‘lw’ instructions have also changed a bit
• In Cycle 1, the ‘lw’ will use $s1 before it’s incremented so it uses the
original address in $s1
ALU/branch Load/store cycle

Loop: addi $s1, $s1,–16 lw $t0, 0($s1) 1
nop lw $t1, 12($s1) 2
addu $t0, $t0, $s2 lw $t2, 8($s1) 3
addu $t1, $t1, $s2 lw $t3, 4($s1) 4
addu $t2, $t2, $s2 sw $t0, 16($s1) 5
addu $t3, $t4, $s2 sw $t1, 12($s1) 6
nop sw $t2, 8($s1) 7
bne $s1, $zero, Loop sw $t3, 4($s1) 8
37
• What is the performance now?
• In 8 clock cycles, there are only two nops
• The maximum IPC here is 2
• The IPC achieved by this code is: 14/8 = 1.75
• This is much better than the IPC of 1.25 without unrolling the loop
38
19
4/8/2016
• Unrolling the loop made the code run 1.4 times faster
– IPC=1.25 vs. IPC=1.75
• However, unrolling the loop made the code larger
• It also used more registers ($t0 to $t3) instead of using $t0 only
• To mitigate this, the CPUs usually contains special registers that are used
for register renaming
39
• One limitation of the VLIW CPUs is that the compiler compiles the code for
a specific hardware
• Therefore, there are many versions of the software one for each hardware
• Microsoft used to issue a special version of the software for the Intel
Itanium CPU, which is a VLIW CPU
• If a software compiled for a VLIW CPU runs on another CPU, it would still
run but it would be slow
• In the worst case, one instruction is issued in each cycle, rather than
multiple
40
20
4/8/2016
Dynamic Multiple Issue
• Dynamic multiple‐issue CPUs are called superscalar CPUs
• A scalar CPU fetches one instruction at a time (like the datapath we’ve
seen earlier)
• In superscalar CPUs, the hardware determines which instructions
execute in parallel
• The approach is slightly different from the VLIW CPU
• Superscalar don’t package instruction
• Instead, the execution is based on queues
• Instructions are fetched, decoded and sent to queues
• The instructions in the queues that have their operand ready execute in
parallel
• Instructions waiting on their operands stay for longer in the queue
41
• In superscalar CPUs, the code is guaranteed by the hardware to run
correctly
• The compiler helps the hardware by spreading out dependences
• This helps the hardware become more successful at overlapping
instructions
• The code is compiled in the same way independently of the hardware
structure
• For example, the compiler compiles in the same way for a 5‐stage
pipeline or a 12‐stage pipeline
• It’s up to each hardware to discover which instructions can execute in
parallel
42
21
4/8/2016
• The instructions issue in‐order so the dependences can be tracked
• They execute in‐order or out‐of‐order (out‐of‐order speeds up the
execution)
• They commit the result in‐order (so the code executes correctly)
Strategies (Based on order of

Stages Tasks
instructions in the code)
Issue Fetch the instructions In-order
ALU operations / In-order /

Execute
memory access out-of-order
Write the result to a register /

Commit In-order
Write to the memory
43
• The simplest superscalar CPUs execute the code in‐order
• At each cycle, the CPU decides whether to issue zero, one or more
instructions
• The compiler spreads out the dependences to improve the issue rate
44
22
4/8/2016
Superscalar datapath
45
Instruction fetch and decode unit
• Fetches the instructions and tracks the dependences
• This is done in‐order so the dependences can be observed
Reservation station & function units
• A queue where the instruction waits until all of its operands are ready
• The operands may be hampered by a cache miss or by another
instruction waiting on the cache miss
• The function units execute the instruction
46
23
4/8/2016
Commit unit
• Writes the results (commits them) to registers and to memory locations
• The commit unit applies the results in‐order so the code gives the
impression that it has executed sequentially
• This ensures the correctness of the code
• The commit unit contains a part called the reorder buffer which holds
the result until they can be committed
• The result are committed based on their order in the code
47
Forwarding results
48
24
4/8/2016
• When the function unit produces a result, it forwards it to reservation
stations where there might be instructions waiting on this result
• Example: a sequence of instructions with a large chain of dependence
• When new instructions arrive in the reservation stations, they may
need results that are in the commit buffer
• Therefore, the commit buffer forwards the data it’s holding with the
reservation stations
49
• The code is issued in‐order so as to establish the dependences between
the variables as shown below
Code
addi t1, s0, 40 # Line 1
lw t2, 0(t1) # Line 2
add t3, t3, t2 # Line 3
or t2, t5, t6 # Line 4
and t3, t3, t2 # Line 5
add a0, zero, zero # Line 6
Read After Write (RAW) Dependences

t1 Lines 1 & 2
t2 Lines 2 & 3
t2 Lines 4 & 5 Code: Read a value from the array, add it to
t3. Then, AND the result with (t5 OR t6).
50
25
4/8/2016
• The code execute out‐of‐order
• Instructions where the operands are available can execute
• Instructions where the operands are not available yet wait
Code
addi t1, s0, 40
lw t2, 0(t1) # Cache miss
add t3, t3, t2 # On hold
or t2, t5, t6 # Ok to execute
and t3, t3, t2 # On hold
add a0, zero, zero # Ok to execute
• The load is a miss, therefore, the ‘add’ cannot execute
• Also, the ‘and’ cannot execute neither (it uses the result of ‘add’)
• However, the ‘or’ and second ‘add’ can execute
• Out‐ord‐order execution is called dynamic pipeline scheduling
51
• The code commits in‐order to ensure correctness
Code
addi t1, s0, 40
lw t2, 0(t1) # Executed later
add t3, t3, t2
or t2, t5, t6 # Execute earlier
and t3, t3, t2
add a0, zero, zero
• Based on the previous slide, the ‘or’ instruction executed while the
‘load’ was waiting on the miss event
• However, the ‘or’ can’t commit its result to register t2 before the ‘load’
• The ‘load’ commits first, then the ‘or’ commits later
• Any instruction further on that uses t2 gets the result of ‘or’
52
26
4/8/2016
• This is another example
Code
lw t0,0(s0) # Cache misss #Commits first
add t1,t0,t2 # On hold
sub t3,t4,t5 # Ok to execute
and t0,t5,t6 # Ok to execute #Commits later
...
sw t0,0(s1)
• ‘Load’ experiences a cache miss; the ‘add’ waits for the miss handling
• The ‘sub’ and ‘and’ can execute meanwhile
• The ‘load’ should commit before the ‘and’
• Accordingly, the ‘store’ uses the result of the ‘and’
53
• Superscalar CPUs usually support hardware‐based speculation
• Branch instructions are especially supported in hardware speculation
• The CPU can use dynamic branch prediction to speculate on branches
• Results computed under speculation are kept in the commit unit until
it’s sure the speculation was correct; otherwise, these results are
flushed
54
27
4/8/2016
What are the advantages of superscalar CPUs over VLIW CPUs?
• In other words, if the compiler can package instructions and deals with
dependences, why do we have to build these things in the hardware?
• #1) The compiler can’t see the events that happen at run‐time, leading it
to make sub‐optimal decisions
• #2) Speculation is more accurate when done by the hardware; branch
speculations can use dynamic branch prediction where the hardware
can track the history of the branch
• The equivalent in software is called an execution profile where we track
the pattern of the branches and feed this information to the compiler so
it produces a better code; but we would have to profile each software
55
• #3) VLIW CPUs require compiling based on the structure of the
hardware; however, superscalar CPU don’t require compiling for the
specific hardware
• Accordingly, once the software has been compiled and shipped, it can
benefit from all the improvement in superscalar CPUs (year after year)
without the need of re‐compiling
56
28
4/8/2016
Practical Considerations
• Modern high‐performance CPUs attempt to issue multiple instructions
per clock cycle (eg: 4, 5 or 6 instructions)
• Most applications are able to sustain a rate of 2 instructions per clock
cycle due to the many data dependences
• The compiler‐based or hardware‐based approach attempt to discover all
the parallelism in the program
• However, it’s not possible to discover all the parallelism
• When the code uses pointer, it’s hard to decide if instructions can
execute in parallel since we may not know if the pointer are equal
• In other cases, two instructions that can execute in parallel are
separated by 100s of lines of code and go undiscovered
57
• Multiple‐issue CPUs provide a huge speedup due to overlapping
instruction execution
• What could hold them down?
• It turns out the memory subsystem could impede the execution speed
• If the memory access is too large and there are a lot of data misses,
even the fastest CPU won’t have the data to continue execution
• Speculation can hide the memory latency, but wrongful speculation
ends up as wasted energy (since instructions that weren’t needed were
executed)
58
29
4/8/2016
• Superscalar CPUs (especially with hardware speculation) use a lot of
transistors to implement
• This also increases the power used by the chip
• In recent years, we have hit the power wall problem where we can’t
increase the power through the CPU due to unmanageable heat
• Therefore, the trend shifted towards having multiple cores that are
simple (single‐issue, no hardware speculation)
59
• The table shows how the trend in CPU design has been changing
• In single‐core CPUs, performance was enhanced by increasing the issue
width and supporting speculation
• Later, performance was enhanced by having multi cores; now the issue
width and the speculation are not as prevalent as before
Microprocessor Year Clock Rate Pipeline Issue Out-of-order/ Cores Power
Stages width Speculation
i486 1989 25MHz 5 1 No 1 5W
Pentium 1993 66MHz 5 2 No 1 10W
Pentium Pro 1997 200MHz 10 3 Yes 1 29W
P4 Willamette 2001 2000MHz 22 3 Yes 1 75W
P4 Prescott 2004 3600MHz 31 3 Yes 1 103W
Core 2006 2930MHz 14 4 Yes 2 75W
UltraSparc III 2003 1950MHz 14 4 No 1 90W
UltraSparc T1 2005 1200MHz 6 1 No 8 70W
60
30
4/8/2016
AMD Opteron X4 (Barcelona)
• The AMD Opteron X4 is based on the Intel x86 architecture
• Opteron X4 is based on a layered architecture
• It breaks down the x86 CISC operations into RISC‐like operations called
ROPs (RISC Operations)
• The ROPs execute in a superscalar datapath with dynamic scheduling
• The figure in the next slide shows the block level layout of the CPU
• This is known as the microarchitecture
Intel CPUs are also based on a layered architecture. They break down x86 instructions
into what are called as Microinstructions. The CPU is said to have a RISC core.
61
AMD Opteron X4
A load or store is split
into two ‘rops’. One
for address
computation and one
for memory access.
62
31
4/8/2016
• the decode unit converts x86 machine code into ROPs
• The remaining parts of the datapath process the ROPs
• Opteron X4 supports register renaming in the hardware (see previous
slides on loop unrolling)
• In x86, the number of general‐purpose registers is (8 in 32‐bit x86) and
(16 in 64‐bit x86)
• These are called the architectural registers which can be seen by the
assembly programmer
• However, the Opteron X4 has 72 registers that support register
renaming
• The mapping between the renaming registers and the architectural
registers is maintained so the data is copied back correctly
63
• Opteron X4 allows having 106 outstanding ROPs that are executing
– 24 integer operations
– 36 floating‐point operations
– 44 loads and stores
64
32
4/8/2016
Challenges to the performance of the Opteron X4?
• Some x86 instructions might map to a large number of ROPs
• Incorrectly speculated branches
• A long series of data dependence stalls the CPU
• Memory access (as in the other CPUs)
65
Conclusion
• ILP executes multiple instructions simultaneously and is a significant
improvement to performance
• The static approach is based on the compiler
• The limitation is that the compiler should know the structure of the
hardware
• The dynamic approach puts the emphasis on the hardware
• The advantage is the hardware can see run time events
• The sophisticated schemes implemented in the hardware use a lot of
transistors and increase the power used by the chip
• This has pushed us to explore multi‐cores that are simple rather than
simple cores that are sophisticated
66
33

EEL 4768: Computer Architecture: Instruction Level Parallelism (ILP)

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

EEL 4768: Computer Architecture: Instruction Level Parallelism (ILP)

Uploaded by

Copyright:

Available Formats

4/8/2016

• Example: A ‘load’ executed in speculation uses an illegal address and

VLIW ALU or Branch Load or Store

Line 1 ALU or Branch Load or Store

Loop: lw $t0, 0($s1) # $t0=array element

ALU/branch Load/store cycle

Loop: lw $t0, 0($s1)

Loop: lw $t0, 0($s1)

ALU/branch Load/store cycle

Strategies (Based on order of

Issue Fetch the instructions In-order

ALU operations / In-order /

Write the result to a register /

Read After Write (RAW) Dependences

i486 1989 25MHz 5 1 No 1 5W

Pentium 1993 66MHz 5 2 No 1 10W

Pentium Pro 1997 200MHz 10 3 Yes 1 29W

P4 Willamette 2001 2000MHz 22 3 Yes 1 75W

P4 Prescott 2004 3600MHz 31 3 Yes 1 103W

Core 2006 2930MHz 14 4 Yes 2 75W

UltraSparc III 2003 1950MHz 14 4 No 1 90W

UltraSparc T1 2005 1200MHz 6 1 No 8 70W

You might also like