Professional Documents
Culture Documents
INTRODUCTION
To fulfill increasing demands for higher performance, it is necessary to process data concurrently to
achieve better throughput instead of processing each instruction sequentially as in a conventional
computer. Processing data concurrently is known as parallel processing. There are two ways by which we
can achieve parallelism. They are:
Multiple Functional Units - System may have two or more ALUs so that they can execute two or
more instructions at the same time.
Multiple Processors - System may have two or more processors operating concurrently.
We know that, the earlier processors had only one Arithmetic and Logic Unit (ALU) in its CPU.
Furthermore, the ALU could perform one function at a time, resulting quite slow process for executing a
long sequence of arithmetic and logical instructions. Now-a-days processors are available with multiple
functional units. These multiple functional units are used to execute functions of ALU which can be
distributed and operated in parallel. This parallel execution increases the speed of process.
Parallel computing is a form of computation in which many calculations are carried out
simultaneously, operating on the principle that large problems can often be divided into smaller ones,
which are then solved concurrently in parallel.
There are several different forms of parallel computing: bit-level, instruction-level, data-level and
task-level parallelism. Parallelism has been employed for many years, mainly in high performance
computing, but interest in it has grown lately due to the physical constraints preventing frequency scaling.
As power consumption and consequently heat generation by computers has become a concern in recent
years, parallel computing has become the dominant paradigm in computer architecture, mainly in the form
of multi-core processors.
Parallel computers can be roughly classified according to the level at which the hardware supports
parallelism, with multi-core and multi-processor computers having multiple processing elements within a
single machine, while clusters and grids use multiple computers to work on the same task. Specialized
parallel computer architectures are sometimes used alongside traditional processors for accelerating
specific tasks.
Parallel computer programs are more difficult to write than sequential ones, because concurrency
introduces several new classes of potential software bugs, of which race conditions are the most common
ones. Communication and synchronization between the different subtasks are typically some of the
greatest obstacles to getting good parallel program performance.
Important Definitions:
Multiprocessors
A computer system with at least two or more processors is called multiprocessor system.
The multiprocessor software must be designed to work with a variable number of processors.
Replacing the large inefficient processors with many smaller , efficient processors can deliver
better performance per watt , if software can efficiently use them.
Multicore Multiprocessors
A multicore is an architecture design that places multiple processors on a single die
(computer chip) to enhance performance and allow simultaneous processing of multiple tasks
more efficiently. Each processor is called a core.
Multiprocessors have been shoved into the spotlight because the power problem means that
further increase in performance will apparently come from more processors per chip rather
than higher clock rates and improved CPI. They are called multicore multiprocessors instead of
multiprocessor microprocessors. To avoid redundancy in naming, processors are often called
‘cores’ in a multicore chip. The number of cores is expected to double every two years. Thus,
programmers who care about performance must become parallel programmers.
Challenges
The tall challenge facing the industry is to create hardware and software that will make it
easy to write correct parallel processing programs that will execute efficiently in performance and
power as the number of cores per chip scales geometrically.
Instruction level parallelism is a kind of parallelism among instructions. It can exist when
instructions in a sequence are independent and thus can be executed in parallel by overlapping. It
refers to the degree to which, on average the instructions of a program can be executed in parallel.
A combination of compiler based optimization and hardware techniques can be used to maximize
instruction level parallelism.
Two methods of ILP
o Increasing the depth of pipeline
By increasing the depth of the pipeline, more instructions can be executed in
parallel simultaneously. The amount of parallelism being exploited is higher, since there
are more operations being overlapped. Performance is potentially greater since the clock
cycle can be shorter.
o Multiple Issue
Multiple issue is a technique which replicates the internal components of the
computer so that it can launch multiple instructions in every pipeline stage. Launching
multiple instructions per stage will allow the instruction execution rate to exceed the clock
rate or the CPI to be less than 1.
Types of Multiple issues
There are two major ways to implement a multiple issue processor such as,
Static multiple Issues – It is an approach to implement a multiple issue
processor where many decisions are made statically by the compiler before
execution.
Dynamic Multiple Issues – It is an approach to implement a multiple issue
processor where many decisions are made during execution by the
processor.
The major differences between these two kinds of issues are the division of
work between the compiler and the hardware, because the division of work dictates
whether decisions are made at compile time or during execution time.
VLIW
Since a static multiple-issue processor usually restricts what mix of instructions can be initiated in a
given clock cycle, it is useful to think of the issue packet as a single instruction allowing several
operations in certain predefined fields. This view led to the original name for this approach – Very Long
Instruction Word (VLIW).
VLIW is a style of instruction set architecture that launches many operations that are defined to be
independent in a single wide instruction, typically with many separate opcode fields.
Dealing with hazards
Compiler
Most static issue processors rely on the compiler to take on some responsibility for
handling data and control hazards. The compilers responsibilities may include static branch
prediction and code scheduling to reduce or prevent all hazards.
In some designs, the compiler takes full responsibility for removing all hazards, scheduling
the code and inserting no-ops so that the code executes without any need for hazard detection or
hardware-generated stalls.
Hardware
The hardware detects the data hazards and generates stalls between two issue packets,
while requiring that the compiler avoid all dependencies within an instruction pair. So, a hazard
generally forces the entire issue packet containing the dependent instruction to stall.
Whether the software must handle all hazards or only try to reduce the fraction of hazards
between separate issue packets, the appearance of having a large single instruction with multiple
operations is reinforced.
Static two-issue pipeline in operation
Reorder the instructions to avoid as many stalls as possible. Assume branches are
predicted, so that control hazards are handled by the hardware. The first three instructions
have data dependencies, and so do the last two.
The scheduled code as it would look on a two-issue MIPS pipeline. The empty slots are ‘nop’.
Advantages
Reordering the instructions will reduce number of stalls and also it increases the performance
of the processor.
2. Loop Unrolling
An important compiler technique to get more performance from loops is loop unrolling,
where multiple copies of the loop body are made. After unrolling, there is more ILP available by
overlapping the instructions from different iterations.
Loop unrolling is a technique to get more performance from loops that access arrays, in
which multiple copies of the loop body are made and instructions from different iterations are
scheduled together.
Example
Let us see how well loop unrolling and scheduling work well in the above example. For
simplicity assume that the loop index is a multiple of 4.
To schedule the loop without any delays, it turns out that we need to make 4 copies of the
loop body. After unrolling and eliminating the unnecessary loop overhead instructions, the
loop will contain four copies of lw, addu, sw, addi and bne.
During the unrolling process, the compiler introduced additional registers ($t1, $t2,
$t3). The goal of this process, called register renaming, is to eliminate dependences that are
not true data dependences, but could either lead to potential hazards or prevent the compiler
from flexibly scheduling the code.
Register Renaming
It is the process of renaming the registers by the compiler or hardware to remove
antidependences.
Consider how the unrolled code would look using only $t0. There would be
repeated instances of lw $t0, 0($s1), addu $t0, $t0, $s2 followed by sw $t0, 4($s1), but
these sequences, despite using $t0, are actually completely independent – no data values
flows between one pair of these instructions and the next pair. This is what is called an
antidependences or name dependence, which is an ordering forced purely by the reuse of a
name, rather than a real data dependence which is also called a true dependence.
Name Dependence /Antidependence
It is an ordering forced by the reuse of a name, typically a register, rather than by a true
dependence that carries a value between two instructions.
Advantages
Renaming the registers during the unrolling process allows the compiler to move these
independent instructions subsequently so as to better schedule the code. The renaming process
eliminates the name dependences, while preserving the true dependences.
Loop unrolling and scheduling with dual issue gave us an improvement factor of almost 2,
partly from reducing the loop control instructions and partly from dual issue execution. The cost of
this performance improvement is using four temporary registers rather than one, as well as a
significant increase in code size.
Register Renaming
The combination of buffering operands in the reservation stations and results in the reorder
buffer provides a form of register renaming, just like that used by the compiler in loop unrolling.
To see how this conceptually works, consider the following steps:
1. When an instruction issues, it is copied to a reservation station for the appropriate
functional unit. Any operands that are available in the register file or reorder buffer
are also immediately copied into the reservation station. The instruction is buffered
in the reservation station until all the operands and the functional units are
available. For the issuing instruction, the register copy of the operand is no longer
required, and if a write to that register occurred, the value could be overwritten.
2. If an operand is not in the register file or reorder buffer, it must be waiting to be
produced by a functional unit. The name of the functional unit that will produce the
result is tracked. When that unit eventually produces the result, it is copied directly
into the waiting reservation station from the functional unit bypassing the registers.
These steps effectively use the reorder buffer and the reservation stations to implement
register renaming.
Out- of- Order Execution
A dynamically scheduled pipeline can be used for analyzing the data flow structure of a program.
The processor then executes the instructions in some order that preserves the data flow order of the
program. This style of execution is called an out-of-order execution, since the instructions can be executed
in a different order than they were fetched.
Out-of-order execution is a situation in pipelined execution when an instruction blocked from
executing does not cause the following instructions to wait.
In-Order Commit
In-order commit is a commit in which the results of pipelined execution are written to the
programmer-visible state in the same order that instructions are fetched.
To make programs behave as if they were running on a simple in-order pipeline, the instruction
fetch and decode unit is required to issue instructions in order, which allows dependences to be tracked,
and the commit unit is required to write results to registers and memory in program fetch order. This
conservative mode is called in-order commit.
If any exception occurs, the computer can point to the last instruction executed, and the only
registers updated will be those written by instructions before the instruction causing the exception.
Although, the front end (fetch & issue) and the back end (commit) of the pipeline run in order, the
functional units are free to initiate execution whenever the data they need is available. Today, all
dynamically scheduled pipelines use in-order commit.
Advantages of Dynamic Scheduling
1. Dynamic scheduling is often extended by including hardware-based speculation, especially for
branch outcomes. By predicting the direction of a branch, a dynamically scheduled processor can
continue to fetch and execute instructions along the predicted path. Because the instructions are
committed in order, we know whether or not the branch was correctly predicted before any
instructions from the predicted path are committed. A speculative, dynamically scheduled pipeline
can also support speculation on load addresses, allowing load-store reordering, and using the
commit unit to avoid incorrect speculation.
2. Not all stalls are predictable; in particular, cache misses can cause unpredictable stalls. Dynamic
scheduling allows the processor to hide some of those stalls by continuing to execute instructions
while waiting for the stall to end.
3. If the processor speculates on branch outcomes using dynamic branch prediction, it cannot know
the exact order of instructions at compile time, since it depends on the predicted and actual
behavior of branches.
4. As the pipeline latency and issue width change from one implementation to another, the best way
to compile a code sequence also changes.
5. Old code will get much of the benefit of a new implementation without the need for recompilation.
General Conclusion:
Both pipelining and multiple-issue execution increase peak instruction throughput and attempt to
exploit instruction-level parallelism (ILP). Data and control dependences in programs, offer an upper limit
on sustained performance because the processor must sometimes wait for a dependence to be resolved.
Software-centric approaches to exploiting ILP rely on the ability of the compiler to find and reduce
the effects of such dependences, while hardware-centric approaches rely on extensions to the pipeline and
issue mechanisms. Speculation performed by the compiler or the hardware, can increase the amount of
ILP that can be exploited, although care must be taken since speculating incorrectly is likely to reduce
performance.
PARALLEL PROCESSING CHALLENGES
Parallel processing will increase the performance of processor and it will reduce the utilization time to
execute a task. The difficulty with parallelism is not the hardware; it is that too few important application
programs have been rewritten to complete tasks sooner on multiprocessors.
It is difficult to write software that uses multiple processors to complete one task faster, and the
problem gets worse as the number of processors increases.
Difficulty in Developing Parallel Processing programs
Developing the parallel processing programs are so harder than the sequential programs because of
the following reasons:
Speedup =
Error: Reference source not found
or Speedup =
Fractionenhanced (Fe)
It is the fraction of the computation time in the original machine that can be
converted to take advantage of the enhancement. For example, if CPU’s I/O section is
enhanced and it is assumed that CPU is busy 60% of the time in I/O operations, then
fractionenhanced = 0.6. Fraction enhanced is always less than or equal to 1.
Speedupenhanced (Se)
It tells how much faster the task would run if the enhancement mode was used for
the entire program. For example, if CPU’s I/O section is made 10 times faster then
speedupenhanced is 10. Speed up enhancement is always greater than 1.
Amdahl’s law gives us a quick way to find the speed up from two factors:
Fractionenhanced (Fe) and Speedupenhanced (Se). It is given as
Therefore, Speedup =
Error: Reference source not found
Speedup =
Error: Reference source not found
Problems related to Amdahl’s Law:
1. Suppose you want to achieve a speed-up of 80 times fester with 100 processors. What percentage
of the original computation can be sequential?
Solution:
Given data’s, Speedup = 80, Speed Enhanced = Se = 100, Fe = ?
Amdahl’s law says that,
We can reformulate Amdahl’s law in terms of speed-up versus the original execution time:
This formula is usually rewritten assuming that the execution time before is 1 for some unit of
time, and the execution time affected by improvement is considered the fraction of the original
execution time:
So we have, Speedup =
Error: Reference source not found
80 =
Error: Reference source not found
Solving for Fe, we have
Thus to achieve a speedup of 80 from 100 processors, the sequential percentage can only be 0.3%.
Thus, we can say that the remaining 49 processors are utilized less than half the time as
compared to 64t for hardest working processor.
b) If one processor has 10% of the load, it must perform 10% x 1600 or 160 additions. Thus,
Thus, we can say that the remaining 49 processors are utilized less than 20% of the time as
compared to 160t for hardest working processor.
This example demonstrates the value of balancing the load, for just a single processor with twice
the load of the others cuts speed-up almost in half, and five times the load on one processor reduces the
speed-up by almost a factor of five.
FLYNN’S CLASSIFICATION
Parallel processing can be classified in many ways. It can be classified according to the internal
organization of processors, according to the interconnection structure used between processors or
according to the flow of information through the system.
One such classification is introduced by Micheal J. Flynn. We know that a typical processing unit
operates by fetching instructions and operands from the main memory, executing the instructions, and
placing the results in the main memory. The steps associated with the processing of an instruction form an
instruction cycle. The instruction can be viewed as forming an instruction stream flowing from main
memory to the processor, while the operands form another stream, data stream, flowing to and from the
processor.
Instruction Stream
Processor Memory
(P) Data Stream (M)
In 1996, Micheal J. Flynn has made an informal and widely used classification of processor
parallelism based on the number of simultaneous instruction and data streams seen by the processor during
program execution.
The classification made by Micheal J. Flynn divides computers into four major groups:
Single Instruction Stream – Single Data Stream (SISD)
Single Instruction Stream – Multiple Data Stream (SIMD)
Multiple Instruction Stream – Single Data Stream (MISD)
Multiple Instruction Stream – Multiple Data Stream (MIMD)
Categorization based on No. of instruction streams & No. of Data streams
The following classification was based on the number of instruction streams and the number of data
streams. Thus, a conventional Uniprocessor has a single instruction stream and single data stream, and a
conventional multiprocessor has multiple instruction streams and multiple data streams.
If you imagine a pipeline in which fetching operands is separate from and follows instruction
decoding, then a PE is the part of a CPU that implements all the stages after instruction decoding, while a
control unit is the part of a CPU that implements all the stages up to instruction decoding. An SIMD
computer connects each control unit not to one PE, but to many PEs.
An application is data parallel if it wants to do the same computation on lots of pieces of data,
which typically come from different squares in a grid. Examples include image processing, weather
forecasting, and computational fluid dynamics (e.g. simulating airflow around a car or inside a jet engine).
SIMD machines cannot use commodity microprocessors, one reason being that it would be very
difficult to modify these to broadcast their control signals to a multitude of processing elements. The
companies that design SIMD machines have all designed their own processing elements and control units.
The processing elements are usually slower than ordinary microprocessors, but they are also much
smaller, which makes it possible to put several on a single chip.
Since the CPUs are nonstandard, SIMD machines need their own compilers and other system
software. The costs of designing the CPU and this system software add significantly to the up-front
investment required for the machine. Due to the multi-million dollar price tags of SIMD machines, this
investment has to be recovered from a relatively small number of customers, so each customer's share of
the development cost is quite high.
SIMD machines were reasonably popular in the late 1980s; at least as popular as machines with
multi-million dollar price tags could be. However, the difficulty of programming them and their
specialized nature (their price/performance is abysmal for any job that is not data parallel) led to the
demise of the companies that designed and sold them. However, the idea survives in a dramatically
scaled-down form, in the multimedia instructions added to most instruction sets during the middle to late
1990s.
Providing more than one arithmetic logic unit (ALU) that can all operate in parallel on different
inputs, providing the same operation, is an example of SIMD. This can be achieved by using multiple
input buses in the CPU for each ALU that load data from multiple registers. The processor's control unit
sends the same command to each of the ALUs to process the data and the results may be stored, again
using multiple output buses. Machines that provide vector operations are classified as SIMD. In this case a
single instruction is simultaneously applied to a vector.
For vector machines, the size of the vector is proportional to the parallelism. This is an example of
spatial parallelism. Pipelining exploits temporal parallelism within a single instruction stream. More
pipeline stages generally lead to more parallelism, to a limit.
Advantages of SIMD
Reduces the cost of control unit over dozens of execution units.
It has reduced instruction bandwidth and program memory.
It needs only one copy of the code that is being executed simultaneously.
SIMD works best when dealing with arrays in ‘for’ loops. Hence, for parallelism to work in SIMD,
there must be a great deal of identically structured data, which is called data-level parallelism.
Disadvantages of SIMD
SIMD is at its weakest in case or switch statements, where each execution unit must perform a
different operation on its data, depending on what data it has.
Execution units with the wrong data are disabled, so that units with proper data may continue.
Such situation essentially run at 1/nth performance, where ‘n’ is the number of cases.
Variations of SIMD
SIMD in x86 – Multimedia Extensions
The most widely used variation of SIMD is found in almost every microprocessor today, and is the
basis of the hundreds of MMX and SSE instructions of the x86 microprocessor. They were added to
improve performance of multimedia programs. These instructions allow the hardware to have many ALUs
operate simultaneously or, equivalently, to partition a single, wide ALU into many parallel smaller ALUs
that operate simultaneously.
This very low cost parallelism for narrow integer data was the original inspiration of the MMX
instructions of the x86. With multimedia extensions, more hardware and instructions were added and it
produced a new extension called Streaming SIMD Extensions (SSE) and now-a-days it is called as
Advanced Vector Extensions (AVX).
AVX supports the simultaneous execution of a pair of 64-bit floating-point numbers. The width of
the operation and the registers is encoded in the opcode of these multimedia instructions. As the data
width of the registers and operations grew, the number of opcodes for multimedia instructions exploded,
and now there are hundreds of SSE instructions to perform the useful combinations.
Vector Processors
An older and more elegant interpretation of SIMD is called a vector architecture, which has been
closely identified with Cray computers. It is again a great match to problems with lots of data-level
parallelism. Rather than having 64 ALUs perform 64 additions simultaneously, like the old array
processors, the vector architectures pipelined the ALU to get good performance at lower cost.
The basic philosophy of vector architecture is to collect data elements from memory, put them in
order into a large set of registers, operate on them sequentially in registers, and then write the results back
to memory. A key feature of vector architectures is a set of vector registers. Thus, vector architecture
might have 32 vector registers, each with 64 64-bit elements.
Vector elements are independent and it can be operated on in parallel. All modern vector
computers have vector functional units with multiple parallel pipelines called vector lanes. A vector
functional unit with parallel pipelines produces two or more results per clock cycle.
Advantages of vector processors
Vector processor greatly reduces the dynamic instruction bandwidth, executing only six
instructions versus almost 600 for MIPS.
The reduction in instructions fetched and executed, saves power.
Frequency of occurrence of pipeline hazards is reduced.
On the vector processor, each vector instruction will only stall for the first element in each vector,
and then subsequent elements will flow smoothly down the pipeline. Thus, pipeline stalls are
required only once per vector operation, rather than once per vector element.
The pipeline stalls can be reduced on MIPS by using loop-unrolling.
Vector vs Scalar
Vector instructions have several important properties compared to conventional instruction set
architectures, which are called scalar architectures in this context:
A single vector instruction specifies a great deal of work – it is equivalent to executing an entire
loop. The instructions fetch and decode bandwidth needed is dramatically reduced.
By using a vector instruction, the compiler or programmer indicates that the computation of each
result in the vector is independent of the computation of other results in the same vector, so
hardware does not have to check for data hazards within a vector instruction.
Vector architectures and compilers have a reputation of making it much easier than MIMD
multiprocessors to write efficient applications when they contain data-level parallelism.
Hardware need only check for data hazards between two vector instructions once per vector
operand, not once for every element within the vectors. Reduced checking can save power as well.
Vector instructions that access memory have a known access pattern. If the vector’s elements are
all adjacent, then fetching the vector from a set of heavily interleaved memory banks works very
well. Thus, the cost of the latency to main memory is seen only once for the entire vector, rather
than once for each word of the vector.
Because an entire loop is replaced by a vector instruction whose behavior is predetermined, control
hazards that would normally arise from the loop branch are nonexistent.
The savings in instruction bandwidth and hazard checking plus the efficient use of memory
bandwidth give vector architectures advantages in power and energy versus scalar architectures.
For these reasons, vector operations can be made faster than a sequence of scalar operations on the
same number of data items, and designers are motivated to include vector units if the application domain
can use them frequently.
Vector vs Multimedia Extensions
Like multimedia extensions found in the x86 SSE instructions, a vector instruction specifies
multiple operations. However, multimedia extensions typically specify a few operations while
vector specifies dozens of operations.
Unlike multimedia extensions, the number of elements in a vector operation is not in the opcode
but in a separate register. This means different versions of the vector architecture can be
implemented with a different number of elements just by changing the contents of that register and
hence retain binary compatibility. In contrast, a new large set of opcodes is added each time the
‘vector’ length changes in the multimedia extension architecture of the x86.
Unlike multimedia extensions, the data transfers need not be contiguous. Vector support both
strided accesses, where the hardware loads every n th data element in memory, and indexed
accesses, where hardware finds the addresses of the items to be loaded in a vector register.
Like multimedia extensions, vector easily captures the flexibility in data widths, so it is easy to
make an operation work on 32 64-bit data elements or 64 32-bit data elements or 128 16-bit data
elements or 256 8-bit data elements.
Sl. No. Vector Architecture Multimedia Extensions
1 It specifies dozens of operations It specifies a few operations
Number of elements in a vector Number of elements in a multimedia
2
operation is not in the opcode extension operation is in the opcode
In vectors, data transfers need not In multimedia extensions, data
3
be contagious transfers need to be contagious
4 It specifies multiple operations It also specifies multiple operations
It easily captures the flexibility in It also easily captures the flexibility in
5
data widths data widths
6 It is easier to evolve over time It is complex to evolve over time.
Generally, vector architectures are a very efficient way to execute data parallel processing programs;
they are better matches to compiler technology than multimedia extensions; and they are easier to evolve
over time than the multimedia extensions to the x86 architecture.
The use of standard components is important because it keeps down the costs of the company
designing the multiprocessor; the development cost of the standard components is spread out over a much
larger number of customers.
In theory, the interconnection network can be something other than a bus. However, for cache
coherence, you need an interconnection network in which each processor sees the traffic between every
other processor and memory, and all such interconnection networks are either buses or have components
which are equivalent to buses. Low-end and midrange multiprocessors use buses; some high-end
multiprocessors use multiple bus systems, or crossbars with broadcast as well as point-to-point capability.
With the MIMD organization, the processors are general purpose and each is able to process all of
the instructions necessary to perform the appropriate data transformation.
It can be divided into two types:
1. Shared Memory
2. Distributed Memory
Shared Memory Architecture
If the processors share a common memory then each processor accesses programs and data stored
in the shared memory and processors communicate with each other via that memory.
HARDWARE MULTITHREADING
Important Terms used in Multithreaded Processors
Multithreading
Multithreading is a higher-level parallelism called thread-level parallelism (TLP) because it is
logically structured as separate threads of execution.
When pipelining is used, it is essential to maximize the utilization of each pipeline stage to
improve throughput. It can be accomplished by executing some instructions in a different order rather than
executing them sequentially as they occur in the instruction stream and initiating execution of some
instructions even though it is not required. However, this approach needs more complex mechanisms in
the design. The designer cannot cross the limitations of circuit complexity and power consumption.
Therefore, another approach is used, called multithreading.
In multithreading, the instruction stream is divided into several smaller streams, called threads,
such that the threads can be executed in parallel. Here, a high degree of instruction-level parallelism can
be achieved without increasing the circuit complexity or power consumption.
Process
A process is an instance of a program running on a computer. The process image is the collection of
program data, stack and attributes that define the process. The process image is stored at a virtual address
space. There are two important characteristics of a process:
1. Resource ownership – A process may get control of resources such as main memory, I/O
channels, I/O devices and files from time to time.
2. Scheduling / Execution – A process execution takes place through one or more programs. This
execution may be interleaved with that of other processes. Operating system decides the
execution state of each process such as running, ready, dispatching priority etc.
Process Switch
A process switch is an operation that switches the process or control from one process to another.
It first saves all the process control data, registers, and other information and then replaces them with the
process information for the second.
Thread
A thread is a separate process with its own instructions and data. A thread may represent a process
that is part of a parallel program consisting of multiple processes, or it may represent an independent
program on its own. A thread includes the program counter, stack pointer and its own area for a stack. It
executes sequentially and can be interrupted to transfer control to another thread.
Thread switch
A thread switch is an operation that switches the control from one thread to another within the
same process. This is cheaper than a process switch.
Explicit threads
User-level threads which are visible to the application program and kernel-level threads which are
visible only to operating system, both are referred to as explicit threads.
Implicit and Explicit Multithreading
Implicit Multithreading refers to the concurrent execution of multiple threads extracted from a
single sequential program.
Explicit Multithreading refers to the concurrent execution of instructions from different explicit
threads, either by interleaving instructions from different threads on shared pipelines or by parallel
execution on parallel pipelines.
Thread – Level Parallelism
Unlike instruction-level parallelism, which exploits implicit parallel operations within a loop or
straight-line code segment, thread-level parallelism is explicitly represented by the use of multiple threads
of execution that are inherently parallel.
Thread-level parallelism is an important alternative to instruction-level parallelism primarily
because it could be more cost-effective to exploit than instruction-level parallelism. There are many
important applications where thread-level parallelism occurs naturally, as it does in many server
applications.
Hardware Multithreading
Hardware multithreading allows multiple threads to share the functional units of a single processor
in an overlapping fashion. To permit this sharing, the processor must duplicate the independent state of
each thread. For example, each thread would have a separate copy of the register file and the PC. The
memory itself can be shared through the virtual memory mechanisms, which already support
multiprogramming.
In addition, the hardware must support the ability to change to a different thread relatively quickly.
In particular, a thread switch should be much more efficient than a process switch, which typically
requires hundreds to thousands of processor cycles while a thread switch can be instantaneous.
Different Approaches of H/W Multithreading
There are two main approaches to hardware multithreading. They are
1. Fine-grained multithreading
2. Coarse-grained multithreading
Fine Grained Multithreading
Fine-grained multithreading is a version of hardware multithreading that suggests switching
between threads after every instruction. It switches between threads on each instruction, resulting in
interleaved execution of multiple threads.
The processor executes two or more threads at a time. It switches from one thread to another at
each clock cycle. During execution, if a thread is blocked because of data dependencies or memory
latencies, then that thread is skipped and a ready-thread is executed.
This interleaving is often done in a round-robin fashion, skipping any threads that are stalled at that
time. To make fine-grained multithreading practical, the processor must be able to switch threads on every
clock cycle.
Advantage
One key advantage of fine-grained multithreading is that it can hide the throughput losses
that arise from both short and long stalls, since instructions from other threads can be executed
when one thread stalls.
Disadvantage
The primary disadvantage of fine-grained multithreading is that it slows down the
execution of the individual threads, since a thread that is ready to execute without stalls will be
delayed by instructions from other threads.
Coarse –Grained Multithreading
Coarse-grained multithreading is a version of hardware multithreading that suggests switching
between threads only after significant events, such as a cache miss. It switches threads only on costly stall
like second-level cache misses.
The processor executes instructions of a thread sequentially and if an event that causes any delay
occurs, it switches to another thread.
This change relieves the need to have thread switching be essentially free and is much less likely to
slow down the execution of an individual thread, since instructions from other threads will only be issued
when a thread encounters a costly stall.
Advantage
Coarse-grained multithreading is much more useful for reducing the penalty of high-cost
stalls, where pipeline refill is negligible compared to the stall time.
It relieves the need to have very fast thread-switching.
Doesn’t slow down the execution of thread, since the instructions from other threads is
issued only when the thread encounters a costly stall.
Disadvantage
Coarse-grained multithreading is limited in its ability to overcome throughput losses,
especially from shorter stalls, due to pipeline start-up costs.
Since a processor with coarse-grained multithreading issues instructions from a single
thread, when a stall occurs, the pipeline must be emptied or frozen.
The new thread that begins executing after the stall must fill the pipeline before instructions
will be able to complete.
Simultaneous Multithreading (SMT)
Simultaneous multithreading is a version of multithreading that lowers the cost of multithreading
by utilizing the resources needed for multiple issue, dynamically scheduled micro-architecture. The wide
superscalar instruction is executed by executing multiple threads simultaneously using multiple execution
units of a superscalar processor.
It is a variation on hardware multithreading that uses the resources of a multiple-issue, dynamically
scheduled processor to exploit thread-level parallelism at the same time it exploits instruction-level
parallelism. The key insight that motivates SMT is that multiple-issue processors often have more
functional unit parallelism available than a single thread can effectively use.
Advantages
Simultaneous Multithreaded Architecture is superior in performance to a multiple-issue
multiprocessor (multiple-issue CMP).
SMP boosts utilization by dynamically scheduling functional units among multiple threads.
SMT also increases hardware design flexibility.
SMT increases the complexity of instruction scheduling.
With register renaming and dynamic scheduling, multiple instructions from independent threads
can be issued without regard to the dependences among them; the resolution of the dependences
can be handled by the dynamic scheduling capability.
Since you are relying on the existing dynamic mechanisms, SMT does not switch resources every
cycle. Instead, SMT is always executing instructions from multiple threads, leaving it up to the
hardware to associate instruction slots and renamed registers with their proper threads.
The following figure illustrates the differences in a processor’s ability to exploit superscalar
resources for the following processor configurations. The top portion shows how four threads would
execute independently on a superscalar with no multithreading support. The bottom portion shows how
the four threads could be combined to execute on the processor more efficiently using three multithreading
options:
A superscalar with coarse-grained multithreading
A superscalar with fine-grained multithreading
A superscalar with simultaneous multithreading
In the above diagram, the horizontal dimension represents the instruction issue capability in each
clock cycle. Vertical dimension represents a sequence of clock cycles. Empty slots indicate that the
corresponding issue slots are unused in that clock cycle.
In a superscalar without hardware multithreading support, the use of issue slots is limited by a
lack of instruction-level parallelism. In addition, a major stall, such as an instruction cache miss, can leave
the entire processor idle.
In the coarse-grained multithreaded superscalar, the long stalls are partially hidden by
switching to another thread that uses the resources of the processor. Although this reduces the number of
completely idle cycles, the pipeline start-up overhead still leads to idle cycles, and limitations to ILP
means all issue slots will not be used. Furthermore, in a coarse-grained multithreaded processor, since
thread switching occurs only when there is a stall and the new thread has a start-up period, there are likely
to be some fully idle cycles remaining.
In the fine-grained multithreaded superscalar, the interleaving of threads mostly eliminates
fully empty slots. Because only a single thread issues instructions in a given clock cycle, however,
limitations in instruction-level parallelism still lead to idle slots within some clock cycles.
In the simultaneous multithreading case, thread-level parallelism and instruction-level
parallelism are both exploited, with multiple threads using the issue slots in a single clock cycle. Ideally,
the issue slot usage is limited by imbalances in the resource needs and resource availability over multiple
threads. In practice, other factors can restrict how many slots are used. Although the above diagram
greatly simplifies the real operation of these processors, it does illustrate the potential performance
advantages of multithreading in general and SMT in particular.
Conclusion
An alternative approach to sharing an address space is that each processor can have their own
private address space. This alternative multiprocessor must communicate via explicit message passing,
which is traditionally the name of such style of computers, provided the system has routines to send and
receive messages. Coordination is built in with message passing, since one processor knows when a
message is sent, and the receiving processor knows when a message arrives. If the sender needs
confirmation that the message has arrived, the receiving processor can then send an acknowledgement
message back to the sender.
Classic Organization of Multiprocessor with multiple private address space (or)
Message-Passing Multiprocessor – Diagram
Message passing
Message passing is nothing but communication between multiple processors by explicitly sending
and receiving information.
Send Message Routine
A routine used by a processor in machines with private memories to pass to another processor.
Receive Message Routine
A routine used by a processor in machines with private memories to accept a message from
another processor.
Some concurrent applications run well on parallel hardware, independent of whether it offers
shared addresses or message passing. In particular, job-level parallelism and applications with little
communication – like web search, mail servers, and file servers – do not require shared addressing to run
well.
Advantages
There were several attempts to build high-performance computers based on high-performance
message-passing networks, and they did offer better absolute communication performance than clusters
built using local area networks.
Disadvantages
The problem was that they were much more expensive. Few applications could justify the higher
communication performance, given the much higher costs.
Example - Clusters
Clusters are collections of computers that are connected to each other using their I/O interconnect
via standard network switches and cables to form a message-passing multiprocessor. Each runs a distinct
copy of the operating system. Virtually every internet service relies on clusters of commodity servers and
switches.
Drawbacks of cluster
o Administration cost – The cost of administering a cluster of n machines is about the same as
the cost of administering n independent machines, while the cost of administering a shared
memory multiprocessor with n processors is about the same as administering a single machine.
o Performance degradation – The processors in a cluster are usually connected using the I/O
interconnect of each computer; whereas the cores in a multiprocessor are usually connected on
the memory interconnect of the computer. The memory interconnect has higher bandwidth and
lower latency, allowing much better communication performance.
o Division of memory – A cluster of n machines has n independent memories and n copies of the
operating system, but a shared memory multiprocessor allows a single program to use almost
all the memory in the computer, and it only needs a single copy of the OS.
Advantages of Clusters
1. High availability – Since a cluster consists of independent computers connected through a local
area network, it is much easier to replace a machine without bringing down the system in cluster
than in an SMP.
2. Scalable – Given that clusters are constructed from whole computers and independent, scalable
networks, this isolation also makes it easier to expand the system without bringing down the
application that runs on top of the cluster.
3. Low cost
4. Improve power efficiency – Clusters consumes less power and works efficiently.
Examples
The search engines that millions of us use every day depend upon this technology. eBay,
Google, Microsoft, Yahoo, and others all have multiple datacenters each with clusters of tens of
thousands of processors.