Lecture 2 - Parallel Programming Platforms (Part I) - Updated - 2021

LECTURE 2: Parallel Computing Platforms
(PART 1)
To accompany the text ``Introduction to Parallel Computing'', Ananth Grama, Anshul

Gupta, George Karypis, and Vipin Kumar
Addison Wesley, 2003.
Page 1 Introduction to High Performance Computing

Topic Overview
1. Implicit Parallelism: Trends in Microprocessor
Architectures
2. Limitations of Memory System Performance
3. Dichotomy of Parallel Computing Platforms
4. Communication Model of Parallel Platforms

Scope of Parallelism
 Conventional architectures coarsely comprise of a processor,
memory system, and the datapath.
 Each of these component present significant performance
bottlenecks. Parallelism addresses each of these components in
significant ways.
 Different applications utilize different aspects of parallelism, such
as:
 data-intensive applications utilize high aggregate throughput
(the number of instructions that can be executed in a unit of
time)
 server applications utilize high aggregate network bandwidth
 scientific applications typically utilize high processing and
memory system performance.

Implicit Parallelism: Trends in Microprocessor
Architectures
 Microprocessor clock speeds have posted impressive
gains over the past two decades.
 In the current technologies, due to higher levels of

device integration, there are a lot of transistors in the
design of the microprocessor.
 Therefore, current processors use these resources

(transistors etc.) to execute multiple instructions in the
same cycle.

Machine Cycle
• The steps performed by the computer processor for each
machine language instruction received.
• The machine cycle is a 4-processes cycle that include:
• Fetch - Retrieve an instruction from the memory.

• Decode - Translate the retrieved instruction into a series of
computer commands.
• Execute - Execute the computer commands.
• Store - Send and write the results back in memory.

Pipelining and Superscalar Execution
• Processors have relied on pipelining to improve execution
rate.
• An instruction pipeline is a technique used in the design
of computers to increase their instruction throughput (the
number of instructions that can be executed in a unit of time).
• The basic instruction cycle is broken up into a series called
a pipeline.
• Rather than processing each instruction sequentially each
instruction is split up into a sequence of steps so different
steps can be executed concurrently and in parallel.
• Pipelining increases instruction throughput by performing
multiple operations at the same time.

Pipelining and Superscalar
Execution
 Pipelining overlaps various stages of instruction execution to
achieve performance.
 An instruction can be executed while the next one is being
decoded and the next one is being fetched.
 Analogy: An assembly line for manufacture of cars

 100 units/time – broken into 10 pipelined stages of 10 units each.
Enables faster execution & increase throughput.
 Limitations of pipelining:
◦ The speed of a pipeline is eventually limited by the slowest stage.
◦ Pipeline can cause bottlenecks.

Example: Basic five-stages pipeline in a RISC machine:

•IF = Instruction Fetch,
•ID = Instruction Decode,
•EX = Execute,
•MEM = Memory access,
•WB = Register write back
In the fourth clock cycle (the green column), the earliest instruction is in
MEM stage, and the latest instruction has not yet entered the pipeline.

 One simple way of alleviating these bottlenecks is to use
multiple pipelines.
 Multiple pipelines:
◦ Improve instruction execution rate.
◦ Multiple instructions are piped into the processor in
parallel.
◦ Multiple instructions are executed on multiple
functional units.
 Superscalar execution: The ability of a processor to
issue multiple instructions in the same cycle.

Superscalar Execution: An Example
Superscalar execution:
The ability of a processor
to issue multiple
instructions in the same
cycle
Example of a two-way
superscalar execution
of instructions.
By fetching and
dispatching two
instructions at a time, a
maximum of two
instructions per cycle
can be completed.

Superscalar Execution: Discussion
 In the example, there is some wastage of resources
(ie: adder utilization) due to data dependencies.
 The example also illustrates that different instruction
mixes with identical semantics can take significantly
different execution time.

Superscalar Execution: Scheduling
 Scheduling of instructions is determined by several factors:
◦ True Data Dependency: The result of one operation is an input to

the next.
◦ Resource Dependency: Two operations require the same
resource.
◦ Branch Dependency: Scheduling instructions across conditional
branch statements cannot be done deterministically a-priori.
 The scheduler (hardware) looks at a large number of instructions in

an instruction queue and selects appropriate number of instructions
to execute concurrently based on these factors.
◦ The complexity of this hardware is an important constraint on
superscalar processors.

Superscalar Execution: Scheduling Issues
 In a simpler model, instructions can be issued only in
the order in which they are encountered. That is, if the
second instruction cannot be issued because it has a data
dependency with the first, only one instruction is issued
in the cycle. This is called in-order issue.
 In a more aggressive model, instructions can be issued
out of order. In this case, if the second instruction has
data dependencies with the first, but the third instruction
does not, the first and third instructions can be co-
scheduled. This is also called dynamic issue.
 Performance of in-order issue is generally limited.

Superscalar Execution: Efficiency Considerations
 Not all functional units can be kept busy at all times.
 Vertical waste: If during a cycle, no functional units are
utilized.
 Horizontal waste: If during a cycle, only some of the
functional units are utilized.
 Due to limited parallelism in typical instruction traces,
dependencies, or the inability of the scheduler to extract
parallelism, the performance of superscalar processors is
eventually limited.
 Conventional microprocessors typically support four-
way superscalar execution.

Very Long Instruction Word (VLIW) Processors
 The hardware cost and complexity of the
superscalar scheduler is a major
consideration in processor design.
 To address this issues, VLIW processors
rely on compile time analysis to identify
and bundle together instructions that can be
executed concurrently.
 These instructions are packed and
dispatched together, and thus the name
very long instruction word.
 This concept was used with some
commercial success in the Multiflow Trace
machine (circa 1984).
 Variants of this concept are employed in
the Intel IA64 processors.

Next

Limitations of
Memory System Performance
 Memory system, and not processor speed, is often the
bottleneck for many applications.
 Memory system performance is largely captured by two
parameters, latency and bandwidth.
 Latency is the time from the issue of a memory request
to the time the data is available at the processor.
 Bandwidth is the rate at which data can be pumped to
the processor by the memory system.

Memory System Performance: Bandwidth
and Latency
 It is very important to understand the difference between
latency and bandwidth.
◦ Consider the example of a fire-hose. If the water comes out of
the hose two seconds after the hydrant is turned on, the latency
of the system is two seconds.
◦ Once the water starts flowing, if the hydrant delivers water at the
rate of 5 gallons/second, the bandwidth of the system is 5
gallons/second.
◦ If you want immediate response from the hydrant, it is important
to reduce latency.
◦ If you want to fight big fires, you want high bandwidth.

Solution: Improve Effective Memory
Latency Using Caches
 Caches are small and fast memory elements between
the processor and DRAM.
 This memory acts as a low-latency high-bandwidth
storage.
◦ Condition: If a piece of data is repeatedly used, the effective
latency of this memory system can be reduced by the cache.
 The fraction of data references satisfied by the cache is

called the cache hit ratio of the computation on the
system.
◦ Cache hit ratio achieved by a code on a memory system often
determines its performance.

Impact of Memory Bandwidth
Memory bandwidth is determined by the bandwidth
of the memory bus as well as the memory units.
• Memory bandwidth can be improved by increasing

the size of memory blocks.
Analogy: Delivery services with various sizes of trucks
The underlying system takes i time units (where i is

the latency of the system) to deliver b units of data
(where b is the block size).

Alternate Approaches for
Hiding Memory Latency
 Consider the problem of browsing the web on a very slow network
connection. We deal with the problem in one of three possible
ways:
◦ Prefetching: We anticipate which pages we are going to browse

ahead of time and issue requests for them in advance;
◦ Multithreading: We open multiple browsers and access

different pages in each browser, thus while we are waiting for
one page to load, we could be reading others; or
◦ Spatial locality: We access a whole bunch of pages in one go -

reducing the latency across various accesses.

Tradeoffs of Multithreading and Prefetching
 Multithreading and prefetching are critically impacted
by the memory bandwidth.
 Multithreaded systems become bandwidth bound instead
of latency bound.
 Multithreading and prefetching:

◦ only address the latency problem and may often worsen the
bandwidth problem.
◦ require significantly more hardware resources in the form of
storage.

Next

Dichotomy of Parallel Computing Platforms
 Based on the logical and physical organization of parallel
platforms.
◦ Logical organization: Programmer’s view of the

platform. An explicitly parallel program must specify:
 parallel/ concurrent tasks (Control Structure).
 interaction between the concurrent subtasks (Communication
model).
◦ Physical organization: The actual hardware

organization of the platform.

Control Structure of Parallel Programs
 Parallelism can be expressed at various levels of granularity - from
instruction level to processes. Between these extremes exist a range of
models, along with corresponding architectural support.
 Processing units (PU/PE) in parallel computers either operate under the
centralized control of a single control unit or work independently.
 If there is a single control unit that dispatches the same instruction to

various processors (that work on different data), the model is referred to
as single instruction stream, multiple data stream (SIMD).
 If each processor has its own control unit, each processor can execute
different instructions on different data items. This model is called
multiple instruction stream, multiple data stream (MIMD).

SIMD and MIMD Processors
A typical SIMD architecture (a) and a typical MIMD architecture (b).

SIMD Processors
 Some of the earliest parallel computers such as the
Illiac IV, MPP, DAP, CM-2, and MasPar MP-1 belonged to
this class of machines.
 Variants of this concept have found use in co-processing

units such as the MMX units in Intel processors and DSP
chips such as the Sharc.
 SIMD relies on the regular structure of computations (such

as those in image processing).

MIMD Processors
 In contrast to SIMD processors, MIMD processors can
execute different programs on different processors.
 A variant of this, called single program multiple data
streams (SPMD) executes the same program on different
processors.
 It is easy to see that SPMD and MIMD are closely
related in terms of programming flexibility and
underlying architectural support.
 Examples of such platforms include current generation
Sun Ultra Servers, SGI Origin Servers, multiprocessor
PCs, workstation clusters, and the IBM SP.

Flynn's Classical Taxonomy
• There are different ways to classify parallel computers.
• One of the more widely used classifications, in use since

1966, is called Flynn's Taxonomy.
• Flynn's taxonomy distinguishes multi-processor

computer architectures according to how they can be
classified along the two independent dimensions
of Instruction Stream and Data Stream. Each of these
dimensions can have only one of two possible
states: Single or Multiple.
29

Flynn's Taxonomy
The matrix below defines the 4 possible classifications:
30

a) Single Instruction, Single Data (SISD)
 A serial (non-parallel) computer

 Single Instruction: Only one instruction stream is being acted on by
the CPU during any one clock cycle
 Single Data: Only one data stream is being used as input during any
one clock cycle
 Deterministic execution
 Examples: older generation mainframes, minicomputers,

workstations and single processor/core PCs.
31

b) Single Instruction, Multiple Data (SIMD)
 A type of parallel computer
 Single Instruction: All processing units execute the same instruction
at any given clock cycle
 Multiple Data: Each processing unit can operate on a different data
element
 Best suited for specialized problems characterized by a high degree
of regularity, such as graphics/image processing.
 Synchronous (lockstep) and deterministic execution
 Two varieties: Processor Arrays and Vector Pipelines
32

c) Multiple Instructions, Single Data (MISD)
 Multiple Instructions: Each processing unit operates on the data
independently via separate instruction streams.
 Single Data: A single data stream is fed into multiple processing
units.
 Few (if any) actual examples of this class of parallel computer have
ever existed.
 Some conceivable uses might be:
o
multiple frequency filters operating on a single signal stream
• multiple cryptography algorithms attempting to crack a single coded
message.
33

d) Multiple Instructions, Multiple Data (MIMD)
 Multiple Instruction: Every processor may be executing a different
instruction stream
 Multiple Data: Every processor may be working with a different data
stream
 Execution can be synchronous or asynchronous, deterministic or non-
Lecture 2: Parallel Computing Platforms (Part 1)

deterministic
 Currently, the most common type of parallel computer - most modern
supercomputers fall into this category.
 Examples: most current supercomputers, networked parallel computer
clusters and "grids", multi-processor SMP computers, multi-core PCs.
 Note: many MIMD architectures also include SIMD execution sub-
components
34

SIMD-MIMD Comparison
 SIMD computers require less hardware than MIMD
computers (single control unit).
 However, since SIMD processors ae specially designed,
they tend to be expensive and have long design cycles.
 Not all applications are naturally suited to SIMD
processors.
 In contrast, platforms supporting the SPMD paradigm
can be built from inexpensive off-the-shelf components
with relatively little effort in a short amount of time.

Next

Communication Model of Parallel
Platforms
 There are two primary forms of data exchange between
parallel tasks:
◦ accessing a shared data space
◦ exchanging messages
 Platformsthat provide a shared data space are called

shared-address-space machines or multiprocessors.
 Platforms that support messaging are also called message

passing platforms or multicomputers.

Communication Model: Examples
38

Shared-Address-Space Platforms
 Part (or all) of the memory is accessible to all processors.
 Processors interact by modifying data objects stored in
this shared-address-space.
 The platform is classified as:
◦ Uniform Memory Access (UMA)
 Processors have equal access time
to memory in the system.
 Identical processors
◦ Non-Uniform Memory Access (NUMA) machine.
 Not all processors have equal access time to all memories.
 Memory access across link is slower

Shared-Address-Space Platforms: Examples
(a) (b)
Typical shared-address-space architectures:

(a)Uniform memory access shared-address-space computer;
(b)Non-uniform memory access on shared-address-space
computer with local memory only.

NUMA and UMA Shared-Address-Space Platforms
 The distinction between NUMA and UMA platforms is
important from the point of view of algorithm design.
NUMA machines require locality from underlying
algorithms for performance.
 Programming these platforms is easier since reads and writes
are implicitly visible to other processors.
 However, read-write data to shared data must be coordinated
(this will be discussed in greater detail when we talk about
threads programming).
 Caches in such machines require coordinated access to
multiple copies. This leads to the cache coherence problem.
 A weaker model of these machines provides an address map,
but not coordinated access. These models are called non
cache coherent shared address space machines.

Message-Passing Platforms
 These platforms comprise of a set of processors and their own
(exclusive) memory.
 These platforms are programmed using (variants of) send and
receive primitives using libraries such as MPI and PVM provide
such primitives.
Beowulf cluster is a multi-

computer architecture which can be used
for parallel computations.
It is a system which usually consists of one

server node, and one or more client nodes
connected via Ethernet or some other
network.

Message Passing
vs.
Shared Address Space Platforms
 Message passing:
◦ requires little hardware support, other than a network.
◦ processors must explicitly communicate with each other
through messages.
◦ data exchanged among processors cannot be shared, it is
copied (using send/receive messages).
 Shared address space platforms:

◦ requires more hardware support
◦ processors access memory through the shared bus.
◦ data sharing between tasks is fast and uniform.

Next Week:
Lecture 3:
Parallel Platforms (Part 2)

Lecture 2 - Parallel Programming Platforms (Part I) - Updated - 2021

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Lecture 2 - Parallel Programming Platforms (Part I) - Updated - 2021

Uploaded by

Copyright:

Available Formats

LECTURE 2: Parallel Computing Platforms

To accompany the text ``Introduction to Parallel Computing'', Ananth Grama, Anshul

Page 1 Introduction to High Performance Computing

Page 2 Introduction to High Performance Computing

Page 3 Introduction to High Performance Computing

 In the current technologies, due to higher levels of

 Therefore, current processors use these resources

Page 4 Introduction to High Performance Computing

• Fetch - Retrieve an instruction from the memory.

Page 5 Introduction to High Performance Computing

Page 6 Introduction to High Performance Computing

 Analogy: An assembly line for manufacture of cars

Page 7 Introduction to High Performance Computing

Example: Basic five-stages pipeline in a RISC machine:

Page 8 Introduction to High Performance Computing

Page 9 Introduction to High Performance Computing

Page 10 Introduction to High Performance Computing

Page 11 Introduction to High Performance Computing

◦ True Data Dependency: The result of one operation is an input to

 The scheduler (hardware) looks at a large number of instructions in

Page 12 Introduction to High Performance Computing

Page 13 Introduction to High Performance Computing

Page 14 Introduction to High Performance Computing

Page 15 Introduction to High Performance Computing

Page 16 Introduction to High Performance Computing

Page 17 Introduction to High Performance Computing

Page 18 Introduction to High Performance Computing

 The fraction of data references satisfied by the cache is

Page 19 Introduction to High Performance Computing

• Memory bandwidth can be improved by increasing

The underlying system takes i time units (where i is

Page 20 Introduction to High Performance Computing

◦ Prefetching: We anticipate which pages we are going to browse

◦ Multithreading: We open multiple browsers and access

◦ Spatial locality: We access a whole bunch of pages in one go -

Page 21 Introduction to High Performance Computing

 Multithreading and prefetching:

Page 22 Introduction to High Performance Computing

Page 23 Introduction to High Performance Computing

◦ Logical organization: Programmer’s view of the

◦ Physical organization: The actual hardware

Page 24 Introduction to High Performance Computing

 If there is a single control unit that dispatches the same instruction to

Page 25 Introduction to High Performance Computing

A typical SIMD architecture (a) and a typical MIMD architecture (b).

Page 26 Introduction to High Performance Computing

 Variants of this concept have found use in co-processing

 SIMD relies on the regular structure of computations (such

Page 27 Introduction to High Performance Computing

Page 28 Introduction to High Performance Computing

• One of the more widely used classifications, in use since

• Flynn's taxonomy distinguishes multi-processor

Page 29 Introduction to High Performance Computing

The matrix below defines the 4 possible classifications:

Page 30 Introduction to High Performance Computing

 A serial (non-parallel) computer

 Examples: older generation mainframes, minicomputers,

Page 31 Introduction to High Performance Computing

Page 32 Introduction to High Performance Computing

Page 33 Introduction to High Performance Computing

Lecture 2: Parallel Computing Platforms (Part 1)

Page 34 Introduction to High Performance Computing

Page 35 Introduction to High Performance Computing