You are on page 1of 29

UNIT IV – PARALLELISM

Instruction-level-parallelism – Parallel processing challenges – Flynn's classification – Hardware


multithreading – Multicore processors

INTRODUCTION

To fulfill increasing demands for higher performance, it is necessary to process data concurrently to
achieve better throughput instead of processing each instruction sequentially as in a conventional
computer. Processing data concurrently is known as parallel processing. There are two ways by which we
can achieve parallelism. They are:
 Multiple Functional Units - System may have two or more ALUs so that they can execute two or
more instructions at the same time.
 Multiple Processors - System may have two or more processors operating concurrently.

We know that, the earlier processors had only one Arithmetic and Logic Unit (ALU) in its CPU.
Furthermore, the ALU could perform one function at a time, resulting quite slow process for executing a
long sequence of arithmetic and logical instructions. Now-a-days processors are available with multiple
functional units. These multiple functional units are used to execute functions of ALU which can be
distributed and operated in parallel. This parallel execution increases the speed of process.

Parallel computing is a form of computation in which many calculations are carried out
simultaneously, operating on the principle that large problems can often be divided into smaller ones,
which are then solved concurrently in parallel.

There are several different forms of parallel computing: bit-level, instruction-level, data-level and
task-level parallelism. Parallelism has been employed for many years, mainly in high performance
computing, but interest in it has grown lately due to the physical constraints preventing frequency scaling.
As power consumption and consequently heat generation by computers has become a concern in recent
years, parallel computing has become the dominant paradigm in computer architecture, mainly in the form
of multi-core processors.

Parallel computers can be roughly classified according to the level at which the hardware supports
parallelism, with multi-core and multi-processor computers having multiple processing elements within a
single machine, while clusters and grids use multiple computers to work on the same task. Specialized
parallel computer architectures are sometimes used alongside traditional processors for accelerating
specific tasks.

Parallel computer programs are more difficult to write than sequential ones, because concurrency
introduces several new classes of potential software bugs, of which race conditions are the most common
ones. Communication and synchronization between the different subtasks are typically some of the
greatest obstacles to getting good parallel program performance.

The maximum possible speed-up of a single program as a result of parallelization is known as


Amdahl’s law.

Important Definitions:

Multi Cores, Multiprocessors, and Clusters

 Multiprocessors
A computer system with at least two or more processors is called multiprocessor system.
The multiprocessor software must be designed to work with a variable number of processors.
Replacing the large inefficient processors with many smaller , efficient processors can deliver
better performance per watt , if software can efficiently use them.

Features of Multiprocessor System:


o Better Performance
o Scalability
o Improve Availability / Reliability
o High Throughput
o Job-Level Parallelism/ Process-Level Parallelism
 Independent jobs running on Individual Process
o Parallel Processing Program
 Single program running on multiple Processor
 Clusters
A set of computers connected over a local area network that function as a single large
multiprocessor is called a cluster. A cluster is composed of microprocessors housed in many
independent servers or PCs. In addition, clusters can serve equally demanding applications
outside the sciences, such as search engines, web servers, email servers, and databases.

 Multicore Multiprocessors
A multicore is an architecture design that places multiple processors on a single die
(computer chip) to enhance performance and allow simultaneous processing of multiple tasks
more efficiently. Each processor is called a core.

Multiprocessors have been shoved into the spotlight because the power problem means that
further increase in performance will apparently come from more processors per chip rather
than higher clock rates and improved CPI. They are called multicore multiprocessors instead of
multiprocessor microprocessors. To avoid redundancy in naming, processors are often called
‘cores’ in a multicore chip. The number of cores is expected to double every two years. Thus,
programmers who care about performance must become parallel programmers.

Challenges
The tall challenge facing the industry is to create hardware and software that will make it
easy to write correct parallel processing programs that will execute efficiently in performance and
power as the number of cores per chip scales geometrically.

INSTRUCTION LEVEL PARALLELISM (ILP)


Introduction/ Concept
ILP is a measure of how many operations in a computer program can be performed
simultaneously. The potential overlap among instructions is called instruction level
parallelism. It is a technique which is used to overlap the execution of instructions to improve
performance. Pipelining is a technique that runs programs faster by overlapping the execution of
instructions. Pipelining is an example of instruction level parallelism.

Instruction level parallelism is a kind of parallelism among instructions. It can exist when
instructions in a sequence are independent and thus can be executed in parallel by overlapping. It
refers to the degree to which, on average the instructions of a program can be executed in parallel.
A combination of compiler based optimization and hardware techniques can be used to maximize
instruction level parallelism.
Two methods of ILP
o Increasing the depth of pipeline
By increasing the depth of the pipeline, more instructions can be executed in
parallel simultaneously. The amount of parallelism being exploited is higher, since there
are more operations being overlapped. Performance is potentially greater since the clock
cycle can be shorter.

o Multiple Issue
Multiple issue is a technique which replicates the internal components of the
computer so that it can launch multiple instructions in every pipeline stage. Launching
multiple instructions per stage will allow the instruction execution rate to exceed the clock
rate or the CPI to be less than 1.
 Types of Multiple issues
There are two major ways to implement a multiple issue processor such as,
 Static multiple Issues – It is an approach to implement a multiple issue
processor where many decisions are made statically by the compiler before
execution.
 Dynamic Multiple Issues – It is an approach to implement a multiple issue
processor where many decisions are made during execution by the
processor.
The major differences between these two kinds of issues are the division of
work between the compiler and the hardware, because the division of work dictates
whether decisions are made at compile time or during execution time.

 Responsibilities of multiple-issue pipeline


There are two primary and distinct responsibilities that must be dealt with
in a multiple issue pipeline such as,

1. Packaging instructions into issue slots


Issue slots are the positions from which instructions could issue in a given clock
cycle.
In static issue processors, this process is partially handled by the compiler
and in case of dynamic issue processors; it is normally dealt at run time by the
processor.

2. Dealing with data and control hazards


In static issue processors, the compiler handles some or all of the data and
control hazards statically. In contrast, most dynamic issue processors attempt to
alleviate at least some classes of hazards using hardware techniques operating at
execution time.

The Concept of Speculation


One of the most important methods for finding and exploiting more ILP is speculation. Speculation is
an approach that allows the compiler or the processor to ‘guess’ about the properties of an instruction, so
as to enable execution to begin for other instructions that may depend on the speculated instruction.
Speculation is an approach whereby the compiler or processor guesses the outcome of an
instruction to remove it as a dependence in executing other instructions.
For example, we might speculate on the outcome of the branch, so that instructions after the branch
could be executed earlier. Another example is that we might speculate that a store that precedes a load
does not refer to the same address, which would allow the load to be executed before the store.
Types of Speculation
1. Compiler based Speculation
The compiler can use speculation to reorder instructions, moving an instruction across a branch or
a load across a store. In compiler-based speculation, exception problems are avoided by adding special
speculation support that allows such exceptions to be ignored until it is clear that they really should occur.
Recovery Mechanism
In the case of speculation in software, the compiler usually inserts additional instructions
that check the accuracy of the speculation and provide a fix-up routine to use when the speculation
is incorrect.
2. Hardware-based Speculation
The processor hardware can perform the same transformation i.e., reordering instructions at
runtime. In hardware-based speculation, exceptions are simply buffered until it is clear that the instruction
causing them is no longer speculative and is ready to complete; at that point the exception is raised, and
normal execution handling proceeds.
Recovery Mechanism
In hardware speculation, the processor usually buffers the speculative results until it knows
they are no longer speculative. If the speculation is correct, the instructions are completed by
allowing the contents of the buffers to be written to the registers or memory. If the speculation is
incorrect, the hardware flushes the buffers and re-executes the correct instruction sequence.

Difficulty /Drawback of speculation


Guess may be wrong
The difficulty with speculation is that the guess may be wrong. So, any speculation mechanism
must include both a method to check if the guess was right and a method to unroll or back out the
effects of the instructions that were executed speculatively. The implementation of this back-out
capability adds complexity.
May introduce exception
Speculating on certain instructions may introduce exceptions that were formerly not present. For
example, suppose a load instruction is moved in a speculative manner, but the address it uses is not
legal when the speculation is incorrect. The result is that an exception that should not have occurred
will occur. This problem is complicated by the fact that if the load instruction were not speculative,
then the exception must occur.
Conclusion
Since speculation can improve performance when done properly and decrease performance when
done carelessly, significant effort goes into deciding when it is appropriate to speculate.
Static Multiple Issue Processors
Static multiple-issue processors all use the compiler to assist with packaging instructions and handling
hazards.
Issue Packet
In a static issue processor, the set of instructions issued in a given clock cycle is called as issue packet.
The packet may be determined statically by the compiler or dynamically by the processor.

VLIW
Since a static multiple-issue processor usually restricts what mix of instructions can be initiated in a
given clock cycle, it is useful to think of the issue packet as a single instruction allowing several
operations in certain predefined fields. This view led to the original name for this approach – Very Long
Instruction Word (VLIW).
VLIW is a style of instruction set architecture that launches many operations that are defined to be
independent in a single wide instruction, typically with many separate opcode fields.
Dealing with hazards
Compiler
Most static issue processors rely on the compiler to take on some responsibility for
handling data and control hazards. The compilers responsibilities may include static branch
prediction and code scheduling to reduce or prevent all hazards.

In some designs, the compiler takes full responsibility for removing all hazards, scheduling
the code and inserting no-ops so that the code executes without any need for hazard detection or
hardware-generated stalls.
Hardware
The hardware detects the data hazards and generates stalls between two issue packets,
while requiring that the compiler avoid all dependencies within an instruction pair. So, a hazard
generally forces the entire issue packet containing the dependent instruction to stall.
Whether the software must handle all hazards or only try to reduce the fraction of hazards
between separate issue packets, the appearance of having a large single instruction with multiple
operations is reinforced.
Static two-issue pipeline in operation

To give a flavor of static multiple-issue, we will consider a simple two-issue MIPS


processor, where one of the instructions can be an integer ALU operation or branch and the other
can be a load or store. Such a design requires issuing of two instructions per cycle which performs
fetching and decoding 64-bits of instructions.
Static two issue data-path diagram
Additional H/W Requirements for two issue processor
To issue an ALU and a data transfer operation in parallel, the first need for additional hardware –
beyond the usual hazard detection and stall logic – is extra ports in the register file. In one clock cycle
we may need to read two registers for the ALU operation and two more for a store, and also one write
port for an ALU operation and one write port for a load. Since the ALU is tied up for the ALU
operation, we also need a separate adder to calculate the effective address for data transfers. Without
these extra resources, our two-issue pipeline would be hindered by structural hazards.
Advantages / Disadvantages of two issue processor
The two-issue processor can improve performance up by a factor of 2. It requires twice as many
instructions to be overlapped during execution, and this additional overlap increases the relative
performance loss from data and control hazards.
Use latency-Definition
It is defined as the number of clock cycles between a load instruction and an instruction that can
use the result of the load without stalling the pipeline.

Techniques for improving performance


To effectively exploit the parallelism available in a multiple-issue processor, more ambitious
compiler or hardware scheduling techniques are needed, and static multiple issue requires that the
compiler take on this role.
1. Reordering the Instructions

In a static two-issue processor, the compiler attempts to reorder instructions to avoid


stalling the pipeline when branches or data dependencies between successive instructions occur. In
doing so, the compiler must ensure that reordering does not cause a change in the outcome of a
computation. The objective is to place useful instructions in these slots. If no useful instructions
can be placed in the slots, then these slots must be filled with ‘nop’ instructions. The dependency
introduced by the condition-code flags reduces the flexibility available for the compiler to reorder
instructions.
Example

Loop: lw $t0, 0($s1);


addu $t0, $t0, $s2;
sw $t0, 0($s1);
addi $s1, $s1, -4;
bne $s1, $zero, Loop;

Reorder the instructions to avoid as many stalls as possible. Assume branches are
predicted, so that control hazards are handled by the hardware. The first three instructions
have data dependencies, and so do the last two.

The scheduled code as it would look on a two-issue MIPS pipeline. The empty slots are ‘nop’.

Advantages
Reordering the instructions will reduce number of stalls and also it increases the performance
of the processor.

2. Loop Unrolling
An important compiler technique to get more performance from loops is loop unrolling,
where multiple copies of the loop body are made. After unrolling, there is more ILP available by
overlapping the instructions from different iterations.
Loop unrolling is a technique to get more performance from loops that access arrays, in
which multiple copies of the loop body are made and instructions from different iterations are
scheduled together.
Example

Loop: lw $t0, 0($s1);


addu $t0, $t0, $s2;
sw $t0, 0($s1);
addi $s1, $s1, -4;
bne $s1, $zero, Loop;

Let us see how well loop unrolling and scheduling work well in the above example. For
simplicity assume that the loop index is a multiple of 4.
To schedule the loop without any delays, it turns out that we need to make 4 copies of the
loop body. After unrolling and eliminating the unnecessary loop overhead instructions, the
loop will contain four copies of lw, addu, sw, addi and bne.

During the unrolling process, the compiler introduced additional registers ($t1, $t2,
$t3). The goal of this process, called register renaming, is to eliminate dependences that are
not true data dependences, but could either lead to potential hazards or prevent the compiler
from flexibly scheduling the code.
Register Renaming
It is the process of renaming the registers by the compiler or hardware to remove
antidependences.
Consider how the unrolled code would look using only $t0. There would be
repeated instances of lw $t0, 0($s1), addu $t0, $t0, $s2 followed by sw $t0, 4($s1), but
these sequences, despite using $t0, are actually completely independent – no data values
flows between one pair of these instructions and the next pair. This is what is called an
antidependences or name dependence, which is an ordering forced purely by the reuse of a
name, rather than a real data dependence which is also called a true dependence.
Name Dependence /Antidependence
It is an ordering forced by the reuse of a name, typically a register, rather than by a true
dependence that carries a value between two instructions.
Advantages
Renaming the registers during the unrolling process allows the compiler to move these
independent instructions subsequently so as to better schedule the code. The renaming process
eliminates the name dependences, while preserving the true dependences.
Loop unrolling and scheduling with dual issue gave us an improvement factor of almost 2,
partly from reducing the loop control instructions and partly from dual issue execution. The cost of
this performance improvement is using four temporary registers rather than one, as well as a
significant increase in code size.

Dynamic Multiple-Issue Processor


Dynamic multiple-issue processors are also known as superscalar processors, or simply superscalars.
In the simplest superscalar processors, instructions issue in order, and the processor decides whether zero,
one, or more instructions can issue in a given clock cycle.
Superscalar Processor
Superscalar is an advanced pipelining technique that enables the processor to execute more than one
instruction per clock cycle by selecting them during execution.
Achieving good performance on such a processor still requires the compiler to try to schedule
instructions to move dependences apart and thereby improve the instruction issue rate. Even with such
compiler scheduling, there is an important difference between this simple superscalar and a VLIW
processor: the code, whether scheduled or not, is guaranteed by the hardware to execute correctly.
Furthermore, compiled code will always run correctly independent of the issue rate or pipeline structure of
the processor. In some VLIW designs, this has not been the case, and recompilation was required when
moving across different processor models; in other static issue processors, code would run correctly across
different implementations, but often so poorly as to make compilation effectively required.
Dynamic Pipeline Scheduling
Many superscalars extend the basic framework of dynamic issue decisions to include dynamic
pipeline scheduling. Dynamic pipeline scheduling chooses which instructions to execute in a given clock
cycle while trying to avoid hazards and stalls.
Three Major Units
Dynamic pipeline scheduling chooses which instructions to execute next, possibly by reordering
them to avoid stalls. In such processors, the pipeline is divided into three major units.
1. Instruction Fetch & issue unit
The first unit fetches instructions, decodes them, and sends each instruction to a
corresponding functional unit for execution.
2. Multiple Functional Units
Each functional unit has buffers, called reservation stations, which hold the operands and
the operation. As soon as the buffer contains all its operands and the functional unit is ready to
execute, the result is calculated. When the result is completed, it is sent to any reservation stations
waiting for this particular result as well as to the commit unit.
Reservation Stations
Reservation station is a buffer within a functional unit that holds the operands and
the operation.
3. Commit unit
It is a unit in a dynamic or out-of-order execution pipeline that decides when it is safe to
release the result of an operation to programmer-visible registers or memory. Commit unit
buffers the result until it is safe to put the result into the register file or into the memory.
Reorder Buffer
The buffer in the commit unit is often called the reorder buffer, is also used to
supply operands, in much the same way as forwarding logic does in a statically scheduled
pipeline. Once a result is committed to the register file, it can be fetched directly from
there, just as in a normal pipeline.
It is a buffer which holds the results in a dynamically scheduled processor until it is
safe to store the results to memory or a register.

Dynamically Scheduled Pipeline – Diagram

Register Renaming
The combination of buffering operands in the reservation stations and results in the reorder
buffer provides a form of register renaming, just like that used by the compiler in loop unrolling.
To see how this conceptually works, consider the following steps:
1. When an instruction issues, it is copied to a reservation station for the appropriate
functional unit. Any operands that are available in the register file or reorder buffer
are also immediately copied into the reservation station. The instruction is buffered
in the reservation station until all the operands and the functional units are
available. For the issuing instruction, the register copy of the operand is no longer
required, and if a write to that register occurred, the value could be overwritten.
2. If an operand is not in the register file or reorder buffer, it must be waiting to be
produced by a functional unit. The name of the functional unit that will produce the
result is tracked. When that unit eventually produces the result, it is copied directly
into the waiting reservation station from the functional unit bypassing the registers.
These steps effectively use the reorder buffer and the reservation stations to implement
register renaming.
Out- of- Order Execution
A dynamically scheduled pipeline can be used for analyzing the data flow structure of a program.
The processor then executes the instructions in some order that preserves the data flow order of the
program. This style of execution is called an out-of-order execution, since the instructions can be executed
in a different order than they were fetched.
Out-of-order execution is a situation in pipelined execution when an instruction blocked from
executing does not cause the following instructions to wait.
In-Order Commit
In-order commit is a commit in which the results of pipelined execution are written to the
programmer-visible state in the same order that instructions are fetched.
To make programs behave as if they were running on a simple in-order pipeline, the instruction
fetch and decode unit is required to issue instructions in order, which allows dependences to be tracked,
and the commit unit is required to write results to registers and memory in program fetch order. This
conservative mode is called in-order commit.
If any exception occurs, the computer can point to the last instruction executed, and the only
registers updated will be those written by instructions before the instruction causing the exception.
Although, the front end (fetch & issue) and the back end (commit) of the pipeline run in order, the
functional units are free to initiate execution whenever the data they need is available. Today, all
dynamically scheduled pipelines use in-order commit.
Advantages of Dynamic Scheduling
1. Dynamic scheduling is often extended by including hardware-based speculation, especially for
branch outcomes. By predicting the direction of a branch, a dynamically scheduled processor can
continue to fetch and execute instructions along the predicted path. Because the instructions are
committed in order, we know whether or not the branch was correctly predicted before any
instructions from the predicted path are committed. A speculative, dynamically scheduled pipeline
can also support speculation on load addresses, allowing load-store reordering, and using the
commit unit to avoid incorrect speculation.
2. Not all stalls are predictable; in particular, cache misses can cause unpredictable stalls. Dynamic
scheduling allows the processor to hide some of those stalls by continuing to execute instructions
while waiting for the stall to end.
3. If the processor speculates on branch outcomes using dynamic branch prediction, it cannot know
the exact order of instructions at compile time, since it depends on the predicted and actual
behavior of branches.
4. As the pipeline latency and issue width change from one implementation to another, the best way
to compile a code sequence also changes.
5. Old code will get much of the benefit of a new implementation without the need for recompilation.
General Conclusion:
Both pipelining and multiple-issue execution increase peak instruction throughput and attempt to
exploit instruction-level parallelism (ILP). Data and control dependences in programs, offer an upper limit
on sustained performance because the processor must sometimes wait for a dependence to be resolved.
Software-centric approaches to exploiting ILP rely on the ability of the compiler to find and reduce
the effects of such dependences, while hardware-centric approaches rely on extensions to the pipeline and
issue mechanisms. Speculation performed by the compiler or the hardware, can increase the amount of
ILP that can be exploited, although care must be taken since speculating incorrectly is likely to reduce
performance.
PARALLEL PROCESSING CHALLENGES
Parallel processing will increase the performance of processor and it will reduce the utilization time to
execute a task. The difficulty with parallelism is not the hardware; it is that too few important application
programs have been rewritten to complete tasks sooner on multiprocessors.
It is difficult to write software that uses multiple processors to complete one task faster, and the
problem gets worse as the number of processors increases.
Difficulty in Developing Parallel Processing programs
Developing the parallel processing programs are so harder than the sequential programs because of
the following reasons:

1. Must get better Performance & Efficiency


The first reason is that you must get better performance and efficiency from a
parallel processing program on a multiprocessor; otherwise, you would just use a sequential
program on a Uniprocessor, as programming is easier.

In fact, Uniprocessor design techniques such as superscalar and out-of-order


execution take advantage of instruction-level parallelism, normally without the
involvement of the programmer. Such innovations reduced the demand for rewriting
programs for multiprocessors, since programmers could do nothing and yet their sequential
programs would run faster on new computers.
It is very difficult to write parallel processing programs that are fast, especially as the number of
processors increases. Because of the following reasons, we cannot get parallel processing programs faster
than sequential programs:
2. Scheduling
Scheduling is a method by which threads, processes or data flows are given access
to system resources. Scheduling is done to load balance and share system resources
effectively or to achieve quality of service.
Scheduling can be done in various fields among that process scheduling is more
important, because in parallel processing we need to schedule the process correctly.
Process scheduling can be done in the following ways:
1. Long term scheduling
2. Medium term scheduling
3. Short term scheduling
4. Dispatcher
3. Load Balancing
The task must be broken into equal number of pieces otherwise some task may be
idle while waiting for the ones with larger pieces to finish. To perform parallel processing,
tasks must be shared equally to all the processor, then only we can avoid the idle time of
any processor.
Load balancing is the process of dividing the amount of work that a computer has to
do between two or more processor, so that more work gets done in the same amount of
time and in general all process gets served faster. Work load has to be distributed evenly
between the processor to obtain the parallel processing task.
4. Time for Synchronization
Synchronization is the most important challenge in parallel processing, because all
the processor have equal work load so it must complete the task within the specified time
period. For parallel processing program, it must have time for synchronization process,
since if any process does not complete the task within the specific time period then we
cannot able to achieve parallel processing.
5. Communication Overhead
Parallel processing is achieved only if there is an efficient communication between
the multiple processors available in the system. The result of a computation done in one
processor may be required by another processor, so the processor has to communicate the
result of the computation by passing the result to the processor which requires the result in
order to proceed with the execution. So, if there is no proper and quick communication
between different processors, parallel processing performance will start to degrade.
6. Amdahl’s law
Amdahl’s law is used to calculate the performance gain that can be obtained by
improving some portion of a computer. It states that the performance improvement to be
gained from using some faster mode of execution is limited by the fraction of the time the
faster mode can be used.
Amdahl’s law reminds us that even small parts of a program must be parallelized if the
program is to make good use of many cores.
Speed-up (Performance Improvement)
It tells us how much faster a task can be executed using the machine with the
enhancement as compare to the original machine. It is defined as

Speedup =
Error: Reference source not found

or Speedup =

Fractionenhanced (Fe)
It is the fraction of the computation time in the original machine that can be
converted to take advantage of the enhancement. For example, if CPU’s I/O section is
enhanced and it is assumed that CPU is busy 60% of the time in I/O operations, then
fractionenhanced = 0.6. Fraction enhanced is always less than or equal to 1.

Speedupenhanced (Se)
It tells how much faster the task would run if the enhancement mode was used for
the entire program. For example, if CPU’s I/O section is made 10 times faster then
speedupenhanced is 10. Speed up enhancement is always greater than 1.

Amdahl’s law gives us a quick way to find the speed up from two factors:
Fractionenhanced (Fe) and Speedupenhanced (Se). It is given as

Error: Reference source not found =


Error: Reference source not found

Therefore, Speedup =
Error: Reference source not found

Speedup =
Error: Reference source not found
Problems related to Amdahl’s Law:
1. Suppose you want to achieve a speed-up of 80 times fester with 100 processors. What percentage
of the original computation can be sequential?
Solution:
Given data’s, Speedup = 80, Speed Enhanced = Se = 100, Fe = ?
Amdahl’s law says that,

We can reformulate Amdahl’s law in terms of speed-up versus the original execution time:
This formula is usually rewritten assuming that the execution time before is 1 for some unit of
time, and the execution time affected by improvement is considered the fraction of the original
execution time:
So we have, Speedup =
Error: Reference source not found
80 =
Error: Reference source not found
Solving for Fe, we have

0.8 x [100 – 99Fe] = 1

Thus to achieve a speedup of 80 from 100 processors, the sequential percentage can only be 0.3%.

2. Speed-up Challenge: Bigger Problem (Increase in Problem Size)


Suppose you want to perform two sums: one is a sum of 20 scalar variables, and one is a matrix
sum of a pair of two-dimensional arrays, with dimensions 20 by 20. What speed-up do you get
with 10 versus 50 processors? Next, calculate the speed-ups assuming the matrices grow to 40 by
40?
Solution:
Let us assume that single addition can be performed in time t. There are 20 additions that do not
benefit from parallel processors and 400 (20 x 20) additions that do. The time required for a single
processor to perform all additions will be 420t. The execution time for 10 processors is

So, the speedup with 10 processors is 420t / 60t = 7.


Execution Time for 50 processors is

So, the speedup with 50 processors is 420t / 28t = 15.


Thus, for this problem size, we get about 70% (7/10 x 100) of the potential speedup with 10
processors. But, we get only 30% (15/50 x 100) speedup with 50 processors.
Look what happens when we increase the matrix.
When we increase the matrix (40 x 40), the sequential program now takes, 1600t + 20t = 1620t.
The execution time for 10 processors is

So, the Speedup with 10 processors is 1620t / 180t = 9.


The Execution Time for 50 processors is

So, the speedup with 50 processors is 1620t / 52t = 31.15.


Thus, for this problem size, we get about 90% (9/10 x 100) of the potential speedup with 10
processors and 62.3% (31.15 / 50 x 100) speedup with 50 processors.
Conclusion:
This examples show that getting good speed-up on a multiprocessor while keeping the problem
size fixed is harder than getting good speed-up by increasing the size of the problem.
This allows us to introduce two terms that describe ways to scale up.
1. Strong Scaling – Speedup achieved on a multiprocessor without increasing the size of the
problem.
2. Weak Scaling – Speedup achieved on a multiprocessor while increasing the size of the
problem proportionally to the increase in the number of processors.
3. Speedup Challenge: Balancing Load
To achieve the speedup of 31.15 on the previous bigger problem size with 50 processors, we
assumed that the load was perfectly balanced. That is, each of the 50 processors performs 2% of
the work. In this problem, we have to calculate the impact on speedup if one processor’s load is
higher than all the rest. Calculate the impact on speedup if the hardest working processor’s load is
4% and 10%. Also calculate the utilization of the rest of the processors?
Solution:
a) If one processor has 4% of the parallel load, then it must do 4% x 1600 or 64 additions, and the
other 49 processors will share the remaining 1536 additions. Since they are operating
simultaneously, we can just calculate the execution time as

The speedup drops from 31.15 to 19.29.

Thus, we can say that the remaining 49 processors are utilized less than half the time as
compared to 64t for hardest working processor.
b) If one processor has 10% of the load, it must perform 10% x 1600 or 160 additions. Thus,

In this case, the speedup drops to 9.

Thus, we can say that the remaining 49 processors are utilized less than 20% of the time as
compared to 160t for hardest working processor.
This example demonstrates the value of balancing the load, for just a single processor with twice
the load of the others cuts speed-up almost in half, and five times the load on one processor reduces the
speed-up by almost a factor of five.
FLYNN’S CLASSIFICATION
Parallel processing can be classified in many ways. It can be classified according to the internal
organization of processors, according to the interconnection structure used between processors or
according to the flow of information through the system.
One such classification is introduced by Micheal J. Flynn. We know that a typical processing unit
operates by fetching instructions and operands from the main memory, executing the instructions, and
placing the results in the main memory. The steps associated with the processing of an instruction form an
instruction cycle. The instruction can be viewed as forming an instruction stream flowing from main
memory to the processor, while the operands form another stream, data stream, flowing to and from the
processor.
Instruction Stream
Processor Memory
(P) Data Stream (M)

In 1996, Micheal J. Flynn has made an informal and widely used classification of processor
parallelism based on the number of simultaneous instruction and data streams seen by the processor during
program execution.
The classification made by Micheal J. Flynn divides computers into four major groups:
 Single Instruction Stream – Single Data Stream (SISD)
 Single Instruction Stream – Multiple Data Stream (SIMD)
 Multiple Instruction Stream – Single Data Stream (MISD)
 Multiple Instruction Stream – Multiple Data Stream (MIMD)
Categorization based on No. of instruction streams & No. of Data streams
The following classification was based on the number of instruction streams and the number of data
streams. Thus, a conventional Uniprocessor has a single instruction stream and single data stream, and a
conventional multiprocessor has multiple instruction streams and multiple data streams.

Single Instruction Stream Single Data Stream (SISD)


A single processor executes a single instruction stream to operate on data stored in a single
memory. Uniprocessors falls into this category. Most conventional machines with one CPU containing a
single arithmetic logic unit (ALU) capable of doing only scalar arithmetic fall into this category. SISD
computers and sequential computers are thus synonymous. In SISD computers, instructions are executed
sequentially but may overlap in their execution stages. They may have more than one functional unit, but
all functional units are controlled by a single control unit.

Single Instruction Stream Multiple Data Stream (SIMD)


A single machine instruction controls the simultaneous execution of a number of processing
elements on a lockstep basis. This category corresponds to array processors. They have multiple
processing / execution units and one control unit. Therefore, all processing / execution units are supervised
by the single control unit. Here, all processing elements receive same instruction from the control unit but
operate on different data sets from distinct data elements.
SIMD computers exploit data level parallelism by applying the same operations to multiple items
of the data in parallel. Each processor has its own data memory but there is a single instruction memory
and control processor, which fetches and dispatches instructions. For applications that display significant
data-level parallelism, the SIMD approach can be very efficient. Vector architecture are the largest class
of SIMD architecture.
For applications with lots of data parallelism, the most cost effective platforms are SIMD
machines. In these machines, a single control unit broadcasts (micro-) instructions to many processing
elements (PE's, each of which is a set of functional units with local storage) in parallel.

If you imagine a pipeline in which fetching operands is separate from and follows instruction
decoding, then a PE is the part of a CPU that implements all the stages after instruction decoding, while a
control unit is the part of a CPU that implements all the stages up to instruction decoding. An SIMD
computer connects each control unit not to one PE, but to many PEs.

An application is data parallel if it wants to do the same computation on lots of pieces of data,
which typically come from different squares in a grid. Examples include image processing, weather
forecasting, and computational fluid dynamics (e.g. simulating airflow around a car or inside a jet engine).

SIMD machines cannot use commodity microprocessors, one reason being that it would be very
difficult to modify these to broadcast their control signals to a multitude of processing elements. The
companies that design SIMD machines have all designed their own processing elements and control units.
The processing elements are usually slower than ordinary microprocessors, but they are also much
smaller, which makes it possible to put several on a single chip.

Since the CPUs are nonstandard, SIMD machines need their own compilers and other system
software. The costs of designing the CPU and this system software add significantly to the up-front
investment required for the machine. Due to the multi-million dollar price tags of SIMD machines, this
investment has to be recovered from a relatively small number of customers, so each customer's share of
the development cost is quite high.

SIMD machines were reasonably popular in the late 1980s; at least as popular as machines with
multi-million dollar price tags could be. However, the difficulty of programming them and their
specialized nature (their price/performance is abysmal for any job that is not data parallel) led to the
demise of the companies that designed and sold them. However, the idea survives in a dramatically
scaled-down form, in the multimedia instructions added to most instruction sets during the middle to late
1990s.

Providing more than one arithmetic logic unit (ALU) that can all operate in parallel on different
inputs, providing the same operation, is an example of SIMD. This can be achieved by using multiple
input buses in the CPU for each ALU that load data from multiple registers. The processor's control unit
sends the same command to each of the ALUs to process the data and the results may be stored, again
using multiple output buses. Machines that provide vector operations are classified as SIMD. In this case a
single instruction is simultaneously applied to a vector.

For vector machines, the size of the vector is proportional to the parallelism. This is an example of
spatial parallelism. Pipelining exploits temporal parallelism within a single instruction stream. More
pipeline stages generally lead to more parallelism, to a limit.

Advantages of SIMD
 Reduces the cost of control unit over dozens of execution units.
 It has reduced instruction bandwidth and program memory.
 It needs only one copy of the code that is being executed simultaneously.
 SIMD works best when dealing with arrays in ‘for’ loops. Hence, for parallelism to work in SIMD,
there must be a great deal of identically structured data, which is called data-level parallelism.
Disadvantages of SIMD
 SIMD is at its weakest in case or switch statements, where each execution unit must perform a
different operation on its data, depending on what data it has.
 Execution units with the wrong data are disabled, so that units with proper data may continue.
Such situation essentially run at 1/nth performance, where ‘n’ is the number of cases.
Variations of SIMD
SIMD in x86 – Multimedia Extensions
The most widely used variation of SIMD is found in almost every microprocessor today, and is the
basis of the hundreds of MMX and SSE instructions of the x86 microprocessor. They were added to
improve performance of multimedia programs. These instructions allow the hardware to have many ALUs
operate simultaneously or, equivalently, to partition a single, wide ALU into many parallel smaller ALUs
that operate simultaneously.
This very low cost parallelism for narrow integer data was the original inspiration of the MMX
instructions of the x86. With multimedia extensions, more hardware and instructions were added and it
produced a new extension called Streaming SIMD Extensions (SSE) and now-a-days it is called as
Advanced Vector Extensions (AVX).
AVX supports the simultaneous execution of a pair of 64-bit floating-point numbers. The width of
the operation and the registers is encoded in the opcode of these multimedia instructions. As the data
width of the registers and operations grew, the number of opcodes for multimedia instructions exploded,
and now there are hundreds of SSE instructions to perform the useful combinations.
Vector Processors
An older and more elegant interpretation of SIMD is called a vector architecture, which has been
closely identified with Cray computers. It is again a great match to problems with lots of data-level
parallelism. Rather than having 64 ALUs perform 64 additions simultaneously, like the old array
processors, the vector architectures pipelined the ALU to get good performance at lower cost.
The basic philosophy of vector architecture is to collect data elements from memory, put them in
order into a large set of registers, operate on them sequentially in registers, and then write the results back
to memory. A key feature of vector architectures is a set of vector registers. Thus, vector architecture
might have 32 vector registers, each with 64 64-bit elements.
Vector elements are independent and it can be operated on in parallel. All modern vector
computers have vector functional units with multiple parallel pipelines called vector lanes. A vector
functional unit with parallel pipelines produces two or more results per clock cycle.
Advantages of vector processors
 Vector processor greatly reduces the dynamic instruction bandwidth, executing only six
instructions versus almost 600 for MIPS.
 The reduction in instructions fetched and executed, saves power.
 Frequency of occurrence of pipeline hazards is reduced.
 On the vector processor, each vector instruction will only stall for the first element in each vector,
and then subsequent elements will flow smoothly down the pipeline. Thus, pipeline stalls are
required only once per vector operation, rather than once per vector element.
 The pipeline stalls can be reduced on MIPS by using loop-unrolling.
Vector vs Scalar
Vector instructions have several important properties compared to conventional instruction set
architectures, which are called scalar architectures in this context:
 A single vector instruction specifies a great deal of work – it is equivalent to executing an entire
loop. The instructions fetch and decode bandwidth needed is dramatically reduced.
 By using a vector instruction, the compiler or programmer indicates that the computation of each
result in the vector is independent of the computation of other results in the same vector, so
hardware does not have to check for data hazards within a vector instruction.
 Vector architectures and compilers have a reputation of making it much easier than MIMD
multiprocessors to write efficient applications when they contain data-level parallelism.
 Hardware need only check for data hazards between two vector instructions once per vector
operand, not once for every element within the vectors. Reduced checking can save power as well.
 Vector instructions that access memory have a known access pattern. If the vector’s elements are
all adjacent, then fetching the vector from a set of heavily interleaved memory banks works very
well. Thus, the cost of the latency to main memory is seen only once for the entire vector, rather
than once for each word of the vector.
 Because an entire loop is replaced by a vector instruction whose behavior is predetermined, control
hazards that would normally arise from the loop branch are nonexistent.
 The savings in instruction bandwidth and hazard checking plus the efficient use of memory
bandwidth give vector architectures advantages in power and energy versus scalar architectures.
For these reasons, vector operations can be made faster than a sequence of scalar operations on the
same number of data items, and designers are motivated to include vector units if the application domain
can use them frequently.
Vector vs Multimedia Extensions
 Like multimedia extensions found in the x86 SSE instructions, a vector instruction specifies
multiple operations. However, multimedia extensions typically specify a few operations while
vector specifies dozens of operations.
 Unlike multimedia extensions, the number of elements in a vector operation is not in the opcode
but in a separate register. This means different versions of the vector architecture can be
implemented with a different number of elements just by changing the contents of that register and
hence retain binary compatibility. In contrast, a new large set of opcodes is added each time the
‘vector’ length changes in the multimedia extension architecture of the x86.
 Unlike multimedia extensions, the data transfers need not be contiguous. Vector support both
strided accesses, where the hardware loads every n th data element in memory, and indexed
accesses, where hardware finds the addresses of the items to be loaded in a vector register.
 Like multimedia extensions, vector easily captures the flexibility in data widths, so it is easy to
make an operation work on 32 64-bit data elements or 64 32-bit data elements or 128 16-bit data
elements or 256 8-bit data elements.
Sl. No. Vector Architecture Multimedia Extensions
1 It specifies dozens of operations It specifies a few operations
Number of elements in a vector Number of elements in a multimedia
2
operation is not in the opcode extension operation is in the opcode
In vectors, data transfers need not In multimedia extensions, data
3
be contagious transfers need to be contagious
4 It specifies multiple operations It also specifies multiple operations
It easily captures the flexibility in It also easily captures the flexibility in
5
data widths data widths
6 It is easier to evolve over time It is complex to evolve over time.
Generally, vector architectures are a very efficient way to execute data parallel processing programs;
they are better matches to compiler technology than multimedia extensions; and they are easier to evolve
over time than the multimedia extensions to the x86 architecture.

Multiple Instruction Stream Single Data Stream (MISD)


A sequence of data is transmitted to a set of processors, each of which executes a different
instruction sequence. This structure is not commercially implemented. Not many parallel processors fit
well into this category. In MISD, there are ‘n’ processor units, each receiving distinct instructions
operating over the same data stream and its derivatives. The results of one processor become the input of
the next processor in the micropipe. The fault-tolerant computers where several processing units process
the same data using different programs belong to the MISD class. The results of such apparently
redundant computations can be compared and used to detect and eliminate faulty results.
Short for multiple instruction, single data. A type of parallel computing architecture that is
classified under Flynn's taxonomy. Each processor owns its control unit and its local memory, making
them more powerful than those used in SIMD computers. Each processor operates under the control of an
instruction stream issued by its control unit: therefore the processors are potentially all executing different
programs on different data while solving different sub-problems of a single problem. This means that the
processors usually operate asynchronously.
Multiple Instruction Stream Multiple Data Stream (MIMD)
A set of processors simultaneously execute different instruction sequences on different data sets.
SMP’s, Clusters and NUMA systems fits into this category. Most multiprocessors system and multiple
computers system can be classified into this category. In MIMD, there are more than one processor unit
having the ability to execute several programs simultaneously. MIMD computer implies interactions
among the multiple processors because all memory streams are derived from the same data space shared
by all processors. If the ‘n’ data streams are derived from disjointed subspaces of the shared memories
then we would have the so-called Multiple SISD (MSISD) operation.
In MIMD, each processor fetches its own instructions and operates on its own data. MIMD
computers exploit thread-level parallelism, since multiple threads operates in parallel. MIMDs offers
flexibility with correct hardware and software support, MIMDs can function as single-user processors
focusing on high performance for one application, as multi-programmed multiprocessors running many
tasks simultaneously.
Most multiprocessors today on the market are (shared memory) MIMD machines. They are built
out of standard processors and standard memory chips, interconnected by a fast bus (memory is
interleaved). If the processor's control unit can send different instructions to each ALU in parallel then the
architecture is MIMD. A superscalar architecture is also MIMD. In this case there are multiple execution
units so that multiple instructions can be issued in parallel.

The use of standard components is important because it keeps down the costs of the company
designing the multiprocessor; the development cost of the standard components is spread out over a much
larger number of customers.

In theory, the interconnection network can be something other than a bus. However, for cache
coherence, you need an interconnection network in which each processor sees the traffic between every
other processor and memory, and all such interconnection networks are either buses or have components
which are equivalent to buses. Low-end and midrange multiprocessors use buses; some high-end
multiprocessors use multiple bus systems, or crossbars with broadcast as well as point-to-point capability.

With the MIMD organization, the processors are general purpose and each is able to process all of
the instructions necessary to perform the appropriate data transformation.
It can be divided into two types:
1. Shared Memory
2. Distributed Memory
Shared Memory Architecture
If the processors share a common memory then each processor accesses programs and data stored
in the shared memory and processors communicate with each other via that memory.

Examples for these kind of systems is


SMP – Symmetric Multiprocessors
In an SMP, multiple processors share a single memory or pool of memory by means of a
shared bus or other interconnection mechanism.
1. NUMA – Non - Uniform Memory Access
All processors have access to all parts of main memory using levels and stores. The
memory access time of a processor differs depending on which region of main memory is
accessed.
Distributed Memory Architecture
In distributed memory, the processor share memory or memory is distributed to all the systems
which are connected in the network.

Example for distributed memory architecture – Clusters


Cluster is a collection of independent Uniprocessor SMP, which may be interconnected to form a
cluster. Communication among the computers is either via fixed paths or via some network facility.
Advantages of clustering method:
 Easy to implement
 High availability
 Low network and server overhead
 Reduced cost
Disadvantages
 Increased complexity
 Requires disk mirroring
 Requires lock manager software

HARDWARE MULTITHREADING
Important Terms used in Multithreaded Processors
Multithreading
Multithreading is a higher-level parallelism called thread-level parallelism (TLP) because it is
logically structured as separate threads of execution.
When pipelining is used, it is essential to maximize the utilization of each pipeline stage to
improve throughput. It can be accomplished by executing some instructions in a different order rather than
executing them sequentially as they occur in the instruction stream and initiating execution of some
instructions even though it is not required. However, this approach needs more complex mechanisms in
the design. The designer cannot cross the limitations of circuit complexity and power consumption.
Therefore, another approach is used, called multithreading.
In multithreading, the instruction stream is divided into several smaller streams, called threads,
such that the threads can be executed in parallel. Here, a high degree of instruction-level parallelism can
be achieved without increasing the circuit complexity or power consumption.
Process
A process is an instance of a program running on a computer. The process image is the collection of
program data, stack and attributes that define the process. The process image is stored at a virtual address
space. There are two important characteristics of a process:
1. Resource ownership – A process may get control of resources such as main memory, I/O
channels, I/O devices and files from time to time.
2. Scheduling / Execution – A process execution takes place through one or more programs. This
execution may be interleaved with that of other processes. Operating system decides the
execution state of each process such as running, ready, dispatching priority etc.
Process Switch
A process switch is an operation that switches the process or control from one process to another.
It first saves all the process control data, registers, and other information and then replaces them with the
process information for the second.
Thread
A thread is a separate process with its own instructions and data. A thread may represent a process
that is part of a parallel program consisting of multiple processes, or it may represent an independent
program on its own. A thread includes the program counter, stack pointer and its own area for a stack. It
executes sequentially and can be interrupted to transfer control to another thread.
Thread switch
A thread switch is an operation that switches the control from one thread to another within the
same process. This is cheaper than a process switch.
Explicit threads
User-level threads which are visible to the application program and kernel-level threads which are
visible only to operating system, both are referred to as explicit threads.
Implicit and Explicit Multithreading
Implicit Multithreading refers to the concurrent execution of multiple threads extracted from a
single sequential program.
Explicit Multithreading refers to the concurrent execution of instructions from different explicit
threads, either by interleaving instructions from different threads on shared pipelines or by parallel
execution on parallel pipelines.
Thread – Level Parallelism
Unlike instruction-level parallelism, which exploits implicit parallel operations within a loop or
straight-line code segment, thread-level parallelism is explicitly represented by the use of multiple threads
of execution that are inherently parallel.
Thread-level parallelism is an important alternative to instruction-level parallelism primarily
because it could be more cost-effective to exploit than instruction-level parallelism. There are many
important applications where thread-level parallelism occurs naturally, as it does in many server
applications.
Hardware Multithreading
Hardware multithreading allows multiple threads to share the functional units of a single processor
in an overlapping fashion. To permit this sharing, the processor must duplicate the independent state of
each thread. For example, each thread would have a separate copy of the register file and the PC. The
memory itself can be shared through the virtual memory mechanisms, which already support
multiprogramming.
In addition, the hardware must support the ability to change to a different thread relatively quickly.
In particular, a thread switch should be much more efficient than a process switch, which typically
requires hundreds to thousands of processor cycles while a thread switch can be instantaneous.
Different Approaches of H/W Multithreading
There are two main approaches to hardware multithreading. They are
1. Fine-grained multithreading
2. Coarse-grained multithreading
Fine Grained Multithreading
Fine-grained multithreading is a version of hardware multithreading that suggests switching
between threads after every instruction. It switches between threads on each instruction, resulting in
interleaved execution of multiple threads.
The processor executes two or more threads at a time. It switches from one thread to another at
each clock cycle. During execution, if a thread is blocked because of data dependencies or memory
latencies, then that thread is skipped and a ready-thread is executed.
This interleaving is often done in a round-robin fashion, skipping any threads that are stalled at that
time. To make fine-grained multithreading practical, the processor must be able to switch threads on every
clock cycle.
Advantage
One key advantage of fine-grained multithreading is that it can hide the throughput losses
that arise from both short and long stalls, since instructions from other threads can be executed
when one thread stalls.
Disadvantage
The primary disadvantage of fine-grained multithreading is that it slows down the
execution of the individual threads, since a thread that is ready to execute without stalls will be
delayed by instructions from other threads.
Coarse –Grained Multithreading
Coarse-grained multithreading is a version of hardware multithreading that suggests switching
between threads only after significant events, such as a cache miss. It switches threads only on costly stall
like second-level cache misses.
The processor executes instructions of a thread sequentially and if an event that causes any delay
occurs, it switches to another thread.
This change relieves the need to have thread switching be essentially free and is much less likely to
slow down the execution of an individual thread, since instructions from other threads will only be issued
when a thread encounters a costly stall.
Advantage
 Coarse-grained multithreading is much more useful for reducing the penalty of high-cost
stalls, where pipeline refill is negligible compared to the stall time.
 It relieves the need to have very fast thread-switching.
 Doesn’t slow down the execution of thread, since the instructions from other threads is
issued only when the thread encounters a costly stall.
Disadvantage
 Coarse-grained multithreading is limited in its ability to overcome throughput losses,
especially from shorter stalls, due to pipeline start-up costs.
 Since a processor with coarse-grained multithreading issues instructions from a single
thread, when a stall occurs, the pipeline must be emptied or frozen.
 The new thread that begins executing after the stall must fill the pipeline before instructions
will be able to complete.
Simultaneous Multithreading (SMT)
Simultaneous multithreading is a version of multithreading that lowers the cost of multithreading
by utilizing the resources needed for multiple issue, dynamically scheduled micro-architecture. The wide
superscalar instruction is executed by executing multiple threads simultaneously using multiple execution
units of a superscalar processor.
It is a variation on hardware multithreading that uses the resources of a multiple-issue, dynamically
scheduled processor to exploit thread-level parallelism at the same time it exploits instruction-level
parallelism. The key insight that motivates SMT is that multiple-issue processors often have more
functional unit parallelism available than a single thread can effectively use.
Advantages
 Simultaneous Multithreaded Architecture is superior in performance to a multiple-issue
multiprocessor (multiple-issue CMP).
 SMP boosts utilization by dynamically scheduling functional units among multiple threads.
 SMT also increases hardware design flexibility.
 SMT increases the complexity of instruction scheduling.
 With register renaming and dynamic scheduling, multiple instructions from independent threads
can be issued without regard to the dependences among them; the resolution of the dependences
can be handled by the dynamic scheduling capability.
 Since you are relying on the existing dynamic mechanisms, SMT does not switch resources every
cycle. Instead, SMT is always executing instructions from multiple threads, leaving it up to the
hardware to associate instruction slots and renamed registers with their proper threads.
The following figure illustrates the differences in a processor’s ability to exploit superscalar
resources for the following processor configurations. The top portion shows how four threads would
execute independently on a superscalar with no multithreading support. The bottom portion shows how
the four threads could be combined to execute on the processor more efficiently using three multithreading
options:
 A superscalar with coarse-grained multithreading
 A superscalar with fine-grained multithreading
 A superscalar with simultaneous multithreading
In the above diagram, the horizontal dimension represents the instruction issue capability in each
clock cycle. Vertical dimension represents a sequence of clock cycles. Empty slots indicate that the
corresponding issue slots are unused in that clock cycle.
In a superscalar without hardware multithreading support, the use of issue slots is limited by a
lack of instruction-level parallelism. In addition, a major stall, such as an instruction cache miss, can leave
the entire processor idle.
In the coarse-grained multithreaded superscalar, the long stalls are partially hidden by
switching to another thread that uses the resources of the processor. Although this reduces the number of
completely idle cycles, the pipeline start-up overhead still leads to idle cycles, and limitations to ILP
means all issue slots will not be used. Furthermore, in a coarse-grained multithreaded processor, since
thread switching occurs only when there is a stall and the new thread has a start-up period, there are likely
to be some fully idle cycles remaining.
In the fine-grained multithreaded superscalar, the interleaving of threads mostly eliminates
fully empty slots. Because only a single thread issues instructions in a given clock cycle, however,
limitations in instruction-level parallelism still lead to idle slots within some clock cycles.
In the simultaneous multithreading case, thread-level parallelism and instruction-level
parallelism are both exploited, with multiple threads using the issue slots in a single clock cycle. Ideally,
the issue slot usage is limited by imbalances in the resource needs and resource availability over multiple
threads. In practice, other factors can restrict how many slots are used. Although the above diagram
greatly simplifies the real operation of these processors, it does illustrate the potential performance
advantages of multithreading in general and SMT in particular.
Conclusion

Let us conclude with 3 observations:


1. First, we know that power wall is forcing a design toward simpler and more power-efficient
processors on a chip. It may well be that the under-utilized resources of out-of-order processors
may be reduced, and so simpler forms of multithreading will be used.
2. Second, a key performance challenge is tolerating latency due to cache misses. Fine-grained
computers switch to another thread on a miss, which is probably more effective in hiding memory
latency than trying to fill unused issue slots as in SMT.
3. A third observation is that the goal of hardware multithreading is to use hardware more efficiently
by sharing components between different tasks. Multicore designs share resources as well. Such
sharing reduces some of the benefits of multithreading compared with providing more non-
multithreaded cores.
Design challenges in SMT
 Impact of fine-grained multithreading on single thread performance.
o A preferred thread approach sacrifices neither throughput nor single-thread performance.
o Unfortunately, with a preferred thread, the processor is likely to sacrifice some throughput,
when preferred thread stalls.
 Larger register file needed to hold multiple contexts.
 Not affecting clock cycle time, especially in
o Instruction issue- more candidate instructions need to be considered.
o Instruction completion – choosing which instructions to commit may be challenging.
 Ensuring that cache and TLB conflicts generated by simultaneous execution of multiple threads do
not cause significant performance degradation.
o The following two observations that are made with respect to these problems are:
 Potential performance overhead due to multithreading is small.
 Efficiency of superscalar is low with the room for significant improvement.
 Works well if:
o Number of complete intensive threads does not exceed the number of threads supported in
SMT.
o Threads have highly different characteristics.
 Does not work well if:
o Threads try to utilize the same functional units.
o Assignment problems like dual processor system where each processor supporting two
threads simultaneously.
MULTI CORE PROCESSOR
A multicore design takes several processor cores and packages them as a single processor. The
goal is to enable the system to run more tasks simultaneously and thereby achieve greater overall system
performance.
Given the difficulty of rewriting old programs to run well on parallel hardware, a natural question
is what computer designers can do to simplify the task.
One answer was to provide a single physical address space that all processors can share, so that
programs need not concern themselves with where they run, merely that they may be executed in parallel.
In this approach, all variables of a program can be made available at any time to any processor. When the
physical address space is common – which is usually the case for multicore chips – then the hardware
typically provides cache coherence to give a consistent view of the shared memory.
The alternative is to have a separate address space per processor that requires that sharing must be
explicit.
Shared Memory Multiprocessor (SMP)
SMP is a parallel processor with a single address space, implying implicit communication with
loads and stores. If offers the programmer a single address space across all processors, although a more
accurate term would have been shared-address multiprocessor. Such systems can still run independent
jobs in their own virtual address space, even if they all share a physical address space. Processors
communicate through shared variables in memory, with all processors capable of accessing any memory
location via loads and stores.

Classic Organization of SMP – Diagram

Two styles of SMP


1. Uniform Memory Access (UMA) Multiprocessor
This multiprocessor takes about the same time to access main memory no matter which processor
requests it and no matter which word is requested. Such machines are called uniform memory access
(UMA) multiprocessors.
2. Non- Uniform Memory Access (NUMA) Multiprocessors
In this multiprocessor, some memory accesses are much faster than others, depending on which
processor asks for which word. Such machines are called non-uniform memory access (NUMA)
multiprocessors.
The programming challenges are harder for a NUMA multiprocessor than for a UMA multiprocessor,
but NUMA machines can scale to larger sizes and NUMAs can have lower latency to nearby memory.
Synchronization
It is the process of coordinating the behavior of two or more processes, which may be running on
different processors. As processors operating in parallel will normally share data, they also need to
coordinate when operating on shared data; otherwise, one processor could start working on data before
another is finished with it. This coordination is called synchronization.
Using Locks
It is a synchronization technique that allows access to data to only one processor at a time.
When sharing is supported with a single address space, there must be a separate mechanism for
synchronization. One approach uses a lock for a shared variable. Only one processor at a time can
acquire the lock, and the other processors interested in shared data must wait until the original
processor unlocks the variable.
Message Passing Multiprocessors

An alternative approach to sharing an address space is that each processor can have their own
private address space. This alternative multiprocessor must communicate via explicit message passing,
which is traditionally the name of such style of computers, provided the system has routines to send and
receive messages. Coordination is built in with message passing, since one processor knows when a
message is sent, and the receiving processor knows when a message arrives. If the sender needs
confirmation that the message has arrived, the receiving processor can then send an acknowledgement
message back to the sender.
Classic Organization of Multiprocessor with multiple private address space (or)
Message-Passing Multiprocessor – Diagram

Message passing
Message passing is nothing but communication between multiple processors by explicitly sending
and receiving information.
Send Message Routine
A routine used by a processor in machines with private memories to pass to another processor.
Receive Message Routine
A routine used by a processor in machines with private memories to accept a message from
another processor.
Some concurrent applications run well on parallel hardware, independent of whether it offers
shared addresses or message passing. In particular, job-level parallelism and applications with little
communication – like web search, mail servers, and file servers – do not require shared addressing to run
well.
Advantages
There were several attempts to build high-performance computers based on high-performance
message-passing networks, and they did offer better absolute communication performance than clusters
built using local area networks.
Disadvantages
The problem was that they were much more expensive. Few applications could justify the higher
communication performance, given the much higher costs.
Example - Clusters
Clusters are collections of computers that are connected to each other using their I/O interconnect
via standard network switches and cables to form a message-passing multiprocessor. Each runs a distinct
copy of the operating system. Virtually every internet service relies on clusters of commodity servers and
switches.
Drawbacks of cluster
o Administration cost – The cost of administering a cluster of n machines is about the same as
the cost of administering n independent machines, while the cost of administering a shared
memory multiprocessor with n processors is about the same as administering a single machine.
o Performance degradation – The processors in a cluster are usually connected using the I/O
interconnect of each computer; whereas the cores in a multiprocessor are usually connected on
the memory interconnect of the computer. The memory interconnect has higher bandwidth and
lower latency, allowing much better communication performance.
o Division of memory – A cluster of n machines has n independent memories and n copies of the
operating system, but a shared memory multiprocessor allows a single program to use almost
all the memory in the computer, and it only needs a single copy of the OS.
Advantages of Clusters
1. High availability – Since a cluster consists of independent computers connected through a local
area network, it is much easier to replace a machine without bringing down the system in cluster
than in an SMP.
2. Scalable – Given that clusters are constructed from whole computers and independent, scalable
networks, this isolation also makes it easier to expand the system without bringing down the
application that runs on top of the cluster.
3. Low cost
4. Improve power efficiency – Clusters consumes less power and works efficiently.
Examples
The search engines that millions of us use every day depend upon this technology. eBay,
Google, Microsoft, Yahoo, and others all have multiple datacenters each with clusters of tens of
thousands of processors.

You might also like