You are on page 1of 44

LECTURE 2: Parallel Computing Platforms

(PART 1)

To accompany the text ``Introduction to Parallel Computing'', Ananth Grama, Anshul


Gupta, George Karypis, and Vipin Kumar
Addison Wesley, 2003.

Page 1 Introduction to High Performance Computing


Topic Overview
1. Implicit Parallelism: Trends in Microprocessor
Architectures
2. Limitations of Memory System Performance
3. Dichotomy of Parallel Computing Platforms
4. Communication Model of Parallel Platforms

Page 2 Introduction to High Performance Computing


Scope of Parallelism
 Conventional architectures coarsely comprise of a processor,
memory system, and the datapath.
 Each of these component present significant performance
bottlenecks. Parallelism addresses each of these components in
significant ways.
 Different applications utilize different aspects of parallelism, such
as:
 data-intensive applications utilize high aggregate throughput
(the number of instructions that can be executed in a unit of
time)
 server applications utilize high aggregate network bandwidth
 scientific applications typically utilize high processing and
memory system performance.

Page 3 Introduction to High Performance Computing


Implicit Parallelism: Trends in Microprocessor
Architectures
 Microprocessor clock speeds have posted impressive
gains over the past two decades.

 In the current technologies, due to higher levels of


device integration, there are a lot of transistors in the
design of the microprocessor.

 Therefore, current processors use these resources


(transistors etc.) to execute multiple instructions in the
same cycle.

Page 4 Introduction to High Performance Computing


Machine Cycle
• The steps performed by the computer processor for each
machine language instruction received.
• The machine cycle is a 4-processes cycle that include:

• Fetch - Retrieve an instruction from the memory.


• Decode - Translate the retrieved instruction into a series of
computer commands.
• Execute - Execute the computer commands.
• Store - Send and write the results back in memory.

Page 5 Introduction to High Performance Computing


Pipelining and Superscalar Execution
• Processors have relied on pipelining to improve execution
rate.
• An instruction pipeline is a technique used in the design
of computers to increase their instruction throughput (the
number of instructions that can be executed in a unit of time).
• The basic instruction cycle is broken up into a series called
a pipeline.
• Rather than processing each instruction sequentially each
instruction is split up into a sequence of steps so different
steps can be executed concurrently and in parallel.
• Pipelining increases instruction throughput by performing
multiple operations at the same time.

Page 6 Introduction to High Performance Computing


Pipelining and Superscalar
Execution
 Pipelining overlaps various stages of instruction execution to
achieve performance.
 An instruction can be executed while the next one is being
decoded and the next one is being fetched.

 Analogy: An assembly line for manufacture of cars


 100 units/time – broken into 10 pipelined stages of 10 units each.
Enables faster execution & increase throughput.

 Limitations of pipelining:
◦ The speed of a pipeline is eventually limited by the slowest stage.
◦ Pipeline can cause bottlenecks.

Page 7 Introduction to High Performance Computing


Pipelining and Superscalar Execution

Example: Basic five-stages pipeline in a RISC machine:


•IF = Instruction Fetch,
•ID = Instruction Decode,
•EX = Execute,
•MEM = Memory access,
•WB = Register write back

In the fourth clock cycle (the green column), the earliest instruction is in
MEM stage, and the latest instruction has not yet entered the pipeline.

Page 8 Introduction to High Performance Computing


Pipelining and Superscalar Execution
 One simple way of alleviating these bottlenecks is to use
multiple pipelines.
 Multiple pipelines:
◦ Improve instruction execution rate.
◦ Multiple instructions are piped into the processor in
parallel.
◦ Multiple instructions are executed on multiple
functional units.
 Superscalar execution: The ability of a processor to
issue multiple instructions in the same cycle.

Page 9 Introduction to High Performance Computing


Superscalar Execution: An Example

Superscalar execution:
The ability of a processor
to issue multiple
instructions in the same
cycle

Example of a two-way
superscalar execution
of instructions.

By fetching and
dispatching two
instructions at a time, a
maximum of two
instructions per cycle
can be completed.

Page 10 Introduction to High Performance Computing


Superscalar Execution: Discussion
 In the example, there is some wastage of resources
(ie: adder utilization) due to data dependencies.
 The example also illustrates that different instruction
mixes with identical semantics can take significantly
different execution time.

Page 11 Introduction to High Performance Computing


Superscalar Execution: Scheduling
 Scheduling of instructions is determined by several factors:

◦ True Data Dependency: The result of one operation is an input to


the next.
◦ Resource Dependency: Two operations require the same
resource.
◦ Branch Dependency: Scheduling instructions across conditional
branch statements cannot be done deterministically a-priori.

 The scheduler (hardware) looks at a large number of instructions in


an instruction queue and selects appropriate number of instructions
to execute concurrently based on these factors.
◦ The complexity of this hardware is an important constraint on
superscalar processors.

Page 12 Introduction to High Performance Computing


Superscalar Execution: Scheduling Issues
 In a simpler model, instructions can be issued only in
the order in which they are encountered. That is, if the
second instruction cannot be issued because it has a data
dependency with the first, only one instruction is issued
in the cycle. This is called in-order issue.
 In a more aggressive model, instructions can be issued
out of order. In this case, if the second instruction has
data dependencies with the first, but the third instruction
does not, the first and third instructions can be co-
scheduled. This is also called dynamic issue.
 Performance of in-order issue is generally limited.

Page 13 Introduction to High Performance Computing


Superscalar Execution: Efficiency Considerations
 Not all functional units can be kept busy at all times.
 Vertical waste: If during a cycle, no functional units are
utilized.
 Horizontal waste: If during a cycle, only some of the
functional units are utilized.
 Due to limited parallelism in typical instruction traces,
dependencies, or the inability of the scheduler to extract
parallelism, the performance of superscalar processors is
eventually limited.
 Conventional microprocessors typically support four-
way superscalar execution.

Page 14 Introduction to High Performance Computing


Very Long Instruction Word (VLIW) Processors
 The hardware cost and complexity of the
superscalar scheduler is a major
consideration in processor design.
 To address this issues, VLIW processors
rely on compile time analysis to identify
and bundle together instructions that can be
executed concurrently.
 These instructions are packed and
dispatched together, and thus the name
very long instruction word.
 This concept was used with some
commercial success in the Multiflow Trace
machine (circa 1984).
 Variants of this concept are employed in
the Intel IA64 processors.

Page 15 Introduction to High Performance Computing


Next

Page 16 Introduction to High Performance Computing


Limitations of
Memory System Performance
 Memory system, and not processor speed, is often the
bottleneck for many applications.
 Memory system performance is largely captured by two
parameters, latency and bandwidth.
 Latency is the time from the issue of a memory request
to the time the data is available at the processor.
 Bandwidth is the rate at which data can be pumped to
the processor by the memory system.

Page 17 Introduction to High Performance Computing


Memory System Performance: Bandwidth
and Latency
 It is very important to understand the difference between
latency and bandwidth.
◦ Consider the example of a fire-hose. If the water comes out of
the hose two seconds after the hydrant is turned on, the latency
of the system is two seconds.
◦ Once the water starts flowing, if the hydrant delivers water at the
rate of 5 gallons/second, the bandwidth of the system is 5
gallons/second.
◦ If you want immediate response from the hydrant, it is important
to reduce latency.
◦ If you want to fight big fires, you want high bandwidth.

Page 18 Introduction to High Performance Computing


Solution: Improve Effective Memory
Latency Using Caches
 Caches are small and fast memory elements between
the processor and DRAM.
 This memory acts as a low-latency high-bandwidth
storage.
◦ Condition: If a piece of data is repeatedly used, the effective
latency of this memory system can be reduced by the cache.

 The fraction of data references satisfied by the cache is


called the cache hit ratio of the computation on the
system.
◦ Cache hit ratio achieved by a code on a memory system often
determines its performance.

Page 19 Introduction to High Performance Computing


Impact of Memory Bandwidth
Memory bandwidth is determined by the bandwidth
of the memory bus as well as the memory units.

• Memory bandwidth can be improved by increasing


the size of memory blocks.
Analogy: Delivery services with various sizes of trucks

The underlying system takes i time units (where i is


the latency of the system) to deliver b units of data
(where b is the block size).

Page 20 Introduction to High Performance Computing


Alternate Approaches for
Hiding Memory Latency
 Consider the problem of browsing the web on a very slow network
connection. We deal with the problem in one of three possible
ways:

◦ Prefetching: We anticipate which pages we are going to browse


ahead of time and issue requests for them in advance;

◦ Multithreading: We open multiple browsers and access


different pages in each browser, thus while we are waiting for
one page to load, we could be reading others; or

◦ Spatial locality: We access a whole bunch of pages in one go -


reducing the latency across various accesses.

Page 21 Introduction to High Performance Computing


Tradeoffs of Multithreading and Prefetching
 Multithreading and prefetching are critically impacted
by the memory bandwidth.
 Multithreaded systems become bandwidth bound instead
of latency bound.

 Multithreading and prefetching:


◦ only address the latency problem and may often worsen the
bandwidth problem.
◦ require significantly more hardware resources in the form of
storage.

Page 22 Introduction to High Performance Computing


Next

Page 23 Introduction to High Performance Computing


Dichotomy of Parallel Computing Platforms
 Based on the logical and physical organization of parallel
platforms.

◦ Logical organization: Programmer’s view of the


platform. An explicitly parallel program must specify:
 parallel/ concurrent tasks (Control Structure).
 interaction between the concurrent subtasks (Communication
model).

◦ Physical organization: The actual hardware


organization of the platform.

Page 24 Introduction to High Performance Computing


Control Structure of Parallel Programs
 Parallelism can be expressed at various levels of granularity - from
instruction level to processes. Between these extremes exist a range of
models, along with corresponding architectural support.
 Processing units (PU/PE) in parallel computers either operate under the
centralized control of a single control unit or work independently.

 If there is a single control unit that dispatches the same instruction to


various processors (that work on different data), the model is referred to
as single instruction stream, multiple data stream (SIMD).

 If each processor has its own control unit, each processor can execute
different instructions on different data items. This model is called
multiple instruction stream, multiple data stream (MIMD).

Page 25 Introduction to High Performance Computing


SIMD and MIMD Processors

A typical SIMD architecture (a) and a typical MIMD architecture (b).

Page 26 Introduction to High Performance Computing


SIMD Processors
 Some of the earliest parallel computers such as the
Illiac IV, MPP, DAP, CM-2, and MasPar MP-1 belonged to
this class of machines.

 Variants of this concept have found use in co-processing


units such as the MMX units in Intel processors and DSP
chips such as the Sharc.

 SIMD relies on the regular structure of computations (such


as those in image processing).

Page 27 Introduction to High Performance Computing


MIMD Processors
 In contrast to SIMD processors, MIMD processors can
execute different programs on different processors.
 A variant of this, called single program multiple data
streams (SPMD) executes the same program on different
processors.
 It is easy to see that SPMD and MIMD are closely
related in terms of programming flexibility and
underlying architectural support.
 Examples of such platforms include current generation
Sun Ultra Servers, SGI Origin Servers, multiprocessor
PCs, workstation clusters, and the IBM SP.

Page 28 Introduction to High Performance Computing


Flynn's Classical Taxonomy
• There are different ways to classify parallel computers.

• One of the more widely used classifications, in use since


1966, is called Flynn's Taxonomy.

• Flynn's taxonomy distinguishes multi-processor


computer architectures according to how they can be
classified along the two independent dimensions
of Instruction Stream and Data Stream. Each of these
dimensions can have only one of two possible
states: Single or Multiple.

29

Page 29 Introduction to High Performance Computing


Flynn's Taxonomy

The matrix below defines the 4 possible classifications:

30

Page 30 Introduction to High Performance Computing


a) Single Instruction, Single Data (SISD)

 A serial (non-parallel) computer


 Single Instruction: Only one instruction stream is being acted on by
the CPU during any one clock cycle
 Single Data: Only one data stream is being used as input during any
one clock cycle
 Deterministic execution

 Examples: older generation mainframes, minicomputers,


workstations and single processor/core PCs.

31

Page 31 Introduction to High Performance Computing


b) Single Instruction, Multiple Data (SIMD)
 A type of parallel computer
 Single Instruction: All processing units execute the same instruction
at any given clock cycle
 Multiple Data: Each processing unit can operate on a different data
element
 Best suited for specialized problems characterized by a high degree
of regularity, such as graphics/image processing.
 Synchronous (lockstep) and deterministic execution
 Two varieties: Processor Arrays and Vector Pipelines

32

Page 32 Introduction to High Performance Computing


c) Multiple Instructions, Single Data (MISD)
 A type of parallel computer
 Multiple Instructions: Each processing unit operates on the data
independently via separate instruction streams.
 Single Data: A single data stream is fed into multiple processing
units.
 Few (if any) actual examples of this class of parallel computer have
ever existed.
 Some conceivable uses might be:
o
multiple frequency filters operating on a single signal stream
• multiple cryptography algorithms attempting to crack a single coded
message.

33

Page 33 Introduction to High Performance Computing


d) Multiple Instructions, Multiple Data (MIMD)
 A type of parallel computer
 Multiple Instruction: Every processor may be executing a different
instruction stream
 Multiple Data: Every processor may be working with a different data
stream
 Execution can be synchronous or asynchronous, deterministic or non-

Lecture 2: Parallel Computing Platforms (Part 1)


deterministic
 Currently, the most common type of parallel computer - most modern
supercomputers fall into this category.
 Examples: most current supercomputers, networked parallel computer
clusters and "grids", multi-processor SMP computers, multi-core PCs.
 Note: many MIMD architectures also include SIMD execution sub-
components

34

Page 34 Introduction to High Performance Computing


SIMD-MIMD Comparison
 SIMD computers require less hardware than MIMD
computers (single control unit).
 However, since SIMD processors ae specially designed,
they tend to be expensive and have long design cycles.
 Not all applications are naturally suited to SIMD
processors.
 In contrast, platforms supporting the SPMD paradigm
can be built from inexpensive off-the-shelf components
with relatively little effort in a short amount of time.

Page 35 Introduction to High Performance Computing


Next

Page 36 Introduction to High Performance Computing


Communication Model of Parallel
Platforms
 There are two primary forms of data exchange between
parallel tasks:
◦ accessing a shared data space
◦ exchanging messages

 Platformsthat provide a shared data space are called


shared-address-space machines or multiprocessors.

 Platforms that support messaging are also called message


passing platforms or multicomputers.

Page 37 Introduction to High Performance Computing


Communication Model: Examples

38

Page 38 Introduction to High Performance Computing


Shared-Address-Space Platforms
 Part (or all) of the memory is accessible to all processors.
 Processors interact by modifying data objects stored in
this shared-address-space.
 The platform is classified as:
◦ Uniform Memory Access (UMA)
 Processors have equal access time
to memory in the system.
 Identical processors
◦ Non-Uniform Memory Access (NUMA) machine.
 Not all processors have equal access time to all memories.
 Memory access across link is slower

Page 39 Introduction to High Performance Computing


Shared-Address-Space Platforms: Examples

(a) (b)

Typical shared-address-space architectures:


(a)Uniform memory access shared-address-space computer;
(b)Non-uniform memory access on shared-address-space
computer with local memory only.

Page 40 Introduction to High Performance Computing


NUMA and UMA Shared-Address-Space Platforms
 The distinction between NUMA and UMA platforms is
important from the point of view of algorithm design.
NUMA machines require locality from underlying
algorithms for performance.
 Programming these platforms is easier since reads and writes
are implicitly visible to other processors.
 However, read-write data to shared data must be coordinated
(this will be discussed in greater detail when we talk about
threads programming).
 Caches in such machines require coordinated access to
multiple copies. This leads to the cache coherence problem.
 A weaker model of these machines provides an address map,
but not coordinated access. These models are called non
cache coherent shared address space machines.

Page 41 Introduction to High Performance Computing


Message-Passing Platforms
 These platforms comprise of a set of processors and their own
(exclusive) memory.
 These platforms are programmed using (variants of) send and
receive primitives using libraries such as MPI and PVM provide
such primitives.

Beowulf cluster is a multi-


computer architecture which can be used
for parallel computations.

It is a system which usually consists of one


server node, and one or more client nodes
connected via Ethernet or some other
network.

Page 42 Introduction to High Performance Computing


Message Passing
vs.
Shared Address Space Platforms
 Message passing:
◦ requires little hardware support, other than a network.
◦ processors must explicitly communicate with each other
through messages.
◦ data exchanged among processors cannot be shared, it is
copied (using send/receive messages).

 Shared address space platforms:


◦ requires more hardware support
◦ processors access memory through the shared bus.
◦ data sharing between tasks is fast and uniform.

Page 43 Introduction to High Performance Computing


Next Week:

Lecture 3:
Parallel Platforms (Part 2)

Page 44 Introduction to High Performance Computing

You might also like