CSE4001: Parallel and Distributed Computing Fundamentals

CSE4001
Parallel and Distributed Computing

Dr.N.Narayanan Prasanth
Associate Professor
Department of Database Systems
School of Computer Science and Engineering (SCOPE)
VIT University Vellore
Mail: n.Prasanth@vit.ac.in
Cabin: SJT116 A40
Objectives
1. To introduce the fundamentals of parallel and distributed computing including
parallel and distributed architectures and paradigms
2. To understand the technologies, system architecture, and communication
architecture that propelled the growth of parallel and distributed computing
systems
3. To develop and execute basic parallel and distributed application using basic
programming models and tools.
Course Outcomes
1. Identify and describe the physical organization of parallel platforms and remember the cost metrics that can be used for
algorithm design.
2. Demonstrate knowledge of the core architectural aspects of Parallel Computing.
3. Indicate key factors that contribute to efficient parallel algorithms and apply a suite of techniques that can be applied
across a wide range of applications.
4. Distinguish the architectural and fundamental models that underpin distributed systems and identify challenges in the
construction of distributed systems.
5. Analyze the issues in synchronizing clocks in distributed systems apply suitable algorithms to solve it effectively and
examine how election algorithms can be implemented in a distributed system.
6. Discuss on how locking, timestamp ordering and optimistic concurrency control may be extended for use with distributed
transactions.
7. Apply the concepts behind software architectures for large scale Web based systems as well as to design, recognize and
evaluate software architectures.
8. Design and build application programs on distributed Systems.
9. Analyze and Implement current distributed Computing research literature.
SYLLABUS
Module 1: Parallelism Fundamentals

Motivation – Key Concepts and Challenges – Overview of Parallel computing – Flynn’s Taxonomy – Multi-Core
Processors – Shared vs Distributed memory.
Module 2: Parallel Architectures

Introduction to OpenMP Programming – Instruction Level Support for Parallel Programming – SIMD – Vector
Processing – GPUs
Module 3 : Parallel Algorithm and Design

Preliminaries – Decomposition Techniques – Characteristics of Tasks and Interactions – Mapping Techniques for
Load balancing – Parallel Algorithm Design
Module 4: Introduction to Distributed Systems

Introduction – Characterization of Distributed Systems – Distributed Shared Memory – Message Passing –
Programming Using the Message Passing Paradigm – Group Communication – Case Study (RPC and Java RMI).
Module 5: Coordination
Time and Global States – Synchronizing Physical Clocks – Logical Time and Logical Clock – Coordination and
Agreement – Distributed Mutual Exclusion – Election Algorithms – Consensus and Related Problems
Module 6:Distributed Transactions
Transaction And Concurrency Control – Nested Transactions – Locks – Optimistic Concurrency Control –
Timestamp Ordering Distributed Transactions – Flat and Nested – Atomic – Two Phase Commit Protocol –
Concurrency Control
Module 7: Distributed System Architecture & its Variants
Distributed File System: Architecture – Processes – Communication Distributed Web-based System:
Architecture – Processes – Communication. Overview of Distributed Computing Platforms.
Module 8: Recent Trends
Text Books
1. Grama, A. (2013). Introduction to parallel computing. Harlow, England: Addison-Wesley.
2. Coulouris, G. (2012). Distributed systems. Boston: Addison-Wesley.
Reference Books
1. Hwang, K., & Briggs, F. (1990). Computer Architecture and Parallel Processing (1st ed.). New York:
McGraw-Hill. 2. Pradeep K. Sinha, “Distributed Operating System: Concepts and Design”, PHI Learning
Pvt. Ltd., 2007.
2. Hager, G. (2017). Introduction To high performance computing for scientists and engineers: CRC press.
3. Tanenbaum, A. (2007). Distributed Systems: Principles and Paradigms (2nd ed.). Pearson.
“Parallelism is the future of computing”
Motivation
• The Computational Power Argument – from Transistors to FLOPS

Gordon Moore specified the Moore's law.
“It is the observation that the number of transistors in a dense integrated circuit doubles
about every two years”
• The Memory/Disk Speed Argument

• gap between processor speed and memory presents a tremendous performance bottleneck
• The Data Communication Argument

Parallel Computing
Parallel computing is the use of two or more processors in combination to solve a single
problem.
Problems that require large amount of computation, memory or both
The computing resource includes
• A single computer with multiple processors

• A single computer with multiple processors and some specialized compute resources like GPU,
FGPA, etc.
• An arbitrary number of computers connected by a network
• A combination of above
Overview: Serial vs Parallel Computing
Limitations in Serial Computing
• Transmission speed
• Limits to miniaturization – a limit is reached on processor technology
• Economic limitations – expensive to make single processor faster
• No faster serial computers

Parallel computing is the simultaneous use of multiple compute resources to solve a
computational problem:
Parallel Computers
• A parallel computer is a set of processors that are able to work cooperatively to solve a
computational problem.
• Requires a computer to be able to process large amounts of data for applications such as
video conferencing, collaborative work environments, computer-aided diagnosis in
medicine, virtual reality, etc.
• Networks connect multiple stand-alone computers (nodes) to make larger parallel computer
clusters.
Fig. Compute Clusters

Fig: IBM BG/Q Compute
Chip with 18 cores (PU) and
16 L2 Cache units (L2)
Example: Typical Lawrence Livermore National Laboratory (LLNL) parallel
computer cluster
• Each compute node is a multi-processor parallel computer in itself

• Multiple compute nodes are networked together with an InfiniBand network
Need for Parallel Computing
The primary reason for using parallel computing:
Solve larger problems

Save time
Provide concurrency
Effective utilization of non-local resources
Cost savings
Overcoming memory constraints
Scope of Parallel Programming
Parallel computing has made a tremendous impact on a variety of areas ranging from computational
simulations for scientific and engineering applications to commercial applications in data mining and
transaction processing.
Applications in Engineering and Design

design of air foils, internal combustion engines, high-speed circuits, etc.
Scientific Applications
Functional and structural characterization of genes and proteins, sequencing of the human genome,
Analysing biological sequences with a view to developing new drugs, etc.
Commercial Applications
cost-effective servers, Linux clusters, etc.
Applications in Computer Systems

Cryptography, distributed control algorithms for embedded systems, etc.
Scope of Parallel Programming
Challenges
1. Architectural changes (Single core – multi – many core)

2. Need for optimal algorithms
3. Seeking Concurrency
4. Achieving parallel speedup
5. Granularity of partitioning among processors (Proper Scheduling)
Communication cost and load balancing cost
6. Locality of computation and communication
Communication between processor or between processor and memory
7. Resource allocation
8. Data access, communication and Synchronization
9. Performance and Scalability
10. Error recovery and support for fault tolerance
Architectural Trends
• Architecture translates technology’s gifts to performance and capability

• Resolves the tradeoff between parallelism and locality
• Current microprocessor: 1/3 compute, 1/3 cache, 1/3 off-chip connect
• Tradeoffs may change with scale and technology advances
• Understanding microprocessor architectural trends

• Helps build intuition about design issues or parallel machines
• Shows fundamental role of parallelism even in “sequential” computers
• Four generations of architectural history: Vacuum tube, transistor, IC, VLSI

Architectural Trends
• Greatest trend in VLSI generation is increase in parallelism

• Up to 1985: bit level parallelism: 4-bit -> 8 bit -> 16-bit
– slows after 32 bit
– adoption of 64-bit almost complete, 128-bit far (not performance issue)
– great inflection point when 32-bit micro and cache fit on a chip
• Mid 80s to mid 90s: instruction level parallelism
– pipelining and simple instruction sets, + compiler advances (RISC)
– on-chip caches and functional units => superscalar execution
– greater sophistication: out of order execution, speculation, prediction
» to deal with control transfer and latency problems
• thread level parallelism - Multithreading
Concepts
Flynn's Classical Taxonomy - different ways to classify parallel computers
• distinguishes multi-processor computer architectures according to how they can be classified
along the two independent dimensions of Instruction Stream and Data Stream.
• Each of these dimensions can have only one of two possible states: Single or Multiple
Concepts and Terminology
Single Instruction, Single Data (SISD)
• A serial (non-parallel) computer

• Single Instruction: Only one instruction stream is being acted on by the CPU during any one
clock cycle
• Single Data: Only one data stream is being used as input during any one clock cycle
• Deterministic execution
• This is the oldest type of computer
• Examples: older generation mainframes, minicomputers, workstations and single
processor/core PCs
Single Instruction, Multiple Data (SIMD)
A type of parallel computer

• Single Instruction: All processing units execute the same instruction at any given clock cycle
• Multiple Data: Each processing unit can operate on a different data element
• Best suited for specialized problems characterized by a high degree of regularity, such as graphics/image
processing.
Two varieties: Processor Arrays and Vector Pipelines
• Examples:
• Processor Arrays: Thinking Machines CM-2, MasPar MP-1 & MP-2, ILLIAC IV
• Vector Pipelines: IBM 9000, Cray X-MP, Y-MP & C90, Fujitsu VP, NEC SX-2, Hitachi S820, ETA10
• Most modern computers, particularly those with graphics processor units (GPUs) employ SIMD
instructions and execution units.
Multiple Instruction, Single Data (MISD)
• A type of parallel computer

• Multiple Instruction: Each processing unit operates on the data independently via separate instruction streams.
• Single Data: A single data stream is fed into multiple processing units.
• Few (if any) actual examples of this class of parallel computer have ever existed.
• Some conceivable uses might be:
• multiple frequency filters operating on a single signal stream
• multiple cryptography algorithms attempting to crack a single coded message.
Multiple Instruction, Multiple Data (MIMD)
• A type of parallel computer

• Multiple Instruction: Every processor may be executing a different instruction stream
• Multiple Data: Every processor may be working with a different data stream
• Execution can be synchronous or asynchronous, deterministic or non-deterministic
• Currently, the most common type of parallel computer - most modern supercomputers fall into this category.
• Examples: most current supercomputers, networked parallel computer clusters and "grids", multi-processor SMP
computers, multi-core PCs.
• Note: many MIMD architectures also include SIMD execution sub-components
GOALS of Parallel Computing
• Reduce wall clock time

• Cheapest Possible Solution Strategy many workstations cost much less then a
large mainframe-style mega-beast
• Locality - processors should be “close” to one another in terms of
communication the more communication latency incurred the longer it will
run
• Memory Constraints – one of the resources to exhaust quickly
Communication
Parallel tasks typically need to exchange data.
There are several ways this can be accomplished, such as through a shared memory bus
or over a network, however the actual event of data exchange is commonly referred to
as communications regardless of the method employed
Shared memory describes a computer architecture where all processors have direct
(usually bus based) access to common physical memory
Synchronization
The coordination of parallel tasks in real time, very often associated with
communications.
Often implemented by establishing a synchronization point within an application
where a task may not proceed further until another task(s) reaches the same or
logically equivalent point
Synchronization usually involves waiting by at least one task, and can therefore cause a
parallel application's wall clock execution time to increase
Some important terminologies
Node - A standalone "computer in a box". Usually comprised of multiple CPUs/processors/cores,

memory, network interfaces, etc. Nodes are networked together to comprise a supercomputer
Task - A logically discrete section of computational work. A task is typically a program or
program-like set of instructions that is executed by a processor
Pipelining - Breaking a task into steps performed by different processor units, with inputs
streaming through, much like an assembly line; a type of parallel computing
• Granularity - In parallel computing, granularity is a qualitative measure of the ratio of
computation to communication.
• Coarse: relatively large amounts of computational work are done between communication events
• Fine: relatively small amounts of computational work are done between communication events
Scalability - Refers to a parallel system's (hardware and/or software) ability to demonstrate a

proportionate increase in parallel speedup with the addition of more resources
Multicore Processors
NNP 38
Multicore Processors - Introduction
A processor core is a processing unit which read instructions to perform specific operations
A multi-core processor is a single computing component with two or more independent processing
units called cores, which read and execute program instructions
A many-core processor is a system with hundreds or thousands of cores and implements parallel
architecture
Core types:
1. Homogeneous (symmetric) cores
2. Heterogeneous (asymmetric) cores – have a mix of core types that run different OS..
NNP 39
Single Core CPU
NNP 40
Why Multi-core?
Performance of a processor depends on
• No. of instructions executed
• Clock cycles per instruction
• Clock frequency / period
To achieve the above, we need the following:

• Instruction Level Parallelism
• Pipelining
• Thread Level Parallelism

• Multithreading
Multi-core processor has the above features
NNP 41
Multi-core architecture
NNP 42
About Multi-core…..
• First multi-core processor by Intel in 2005
• MIMD
• It uses either homogeneous / heterogeneous configuration
Various generations:
• Dual core - eg. Intel Core Duo
• Quad core – eg. Intel i5 processor
• Hexa core - eg. Intel i7 extreme edition 980x processor
• Octa core – eg. Intel Xeon E7-2820 processor
• Deca core – eg. Intel Xeon E7 – 2850 processor
NNP 43
Example: Quad-core processor
NNP 44
NNP 45
NNP 46
Homogeneous Multicore Processor
Heterogeneous Multicore Processor
NNP 49
Pro’s Cons
• Energy Efficiency • Shared resources
• True concurrency • Interference
• Performance • Concurrency defects
• Isolation • Non-determinism
• Reliability and Robustness • Analysis difficulty
• Obsolescence Avoidance
• Hardware cost
Shared vs Distributed Memory
Shared Memory Architecture
• A multiprocessor is a multiple-CPU computer with a shared memory.

• The ability for all processors to access all memory as global address space
• Multiple processors can operate independently but share the same memory resources
• Changes in a memory location effected by one processor are visible to all other processors
• Shared memory machines have been classified as UMA and NUMA, based upon memory
access times
Uniform Memory Access (UMA)
• Fig. Shared memory (UMA)
• Most commonly represented today by Symmetric Multiprocessor (SMP) machines

• Identical processors
• All the processors take equal memory access time
• The presence of large, efficient instruction and data caches reduces the load a single processor puts
on the memory bus and memory, allowing these resources to be shared among on multiple
processors.
• Private data are data items used only by a single processor, while shared data are data values used
by multiple processors
• Cache coherence problem

• Solution: Write invalidate /write update (snooping protocol types)
• Processor Synchronization – Mutual exclusion / barrier synchronization
Fig. Cache coherence problem

Non Uniform Memory Access (NUMA)
Non Uniform Memory Access (NUMA)
• The shared memory is physically distributed among all the processors, called local memories
• The collection of all local memories forms a global address space which can be accessed by all the
processors.
• Access time varies with the location of the memory word –

• Most memory access are local and fast
• Memory access across link is slower
• If cache coherency is maintained, then may also be called CC-NUMA - Cache Coherent NUMA
Shared Memory Architecture
Advantages:
• Global address space provides a user-friendly programming perspective to memory
• Data sharing between tasks is both fast and uniform due to the proximity of memory to CPUs
Disadvantages:
• Lack of scalability between memory and CPUs. Adding more CPUs can geometrically increases
traffic on the shared memory-CPU path, and for cache coherent systems, geometrically increase
traffic associated with cache/memory management.
• Programmer responsibility for synchronization constructs that ensure "correct" access of global
memory.
Distributed Memory Architecture
• Processors have their own local memory.
• Memory addresses in one processor do not map to another processor, so there is no concept of global
address space across all processors.
• Changes it makes to its local memory have no effect on the memory of other processors. Hence, the
concept of cache coherency does not apply.
• When a processor needs access to data in another processor, it is usually the task of the programmer to
explicitly define how and when data is communicated.
• Synchronization between tasks is likewise the programmer's responsibility.
Advantages:
Memory is scalable with the number of processors.
Increase the number of processors and the size of memory increases proportionately.
Each processor can rapidly access its own memory without interference and without the overhead incurred
with trying to maintain global cache coherency.
Cost effectiveness: can use commodity, off-the-shelf processors and networking.
Disadvantages:
The programmer is responsible for many of the details associated with data communication
between processors.
It may be difficult to map existing data structures, based on global memory, to this memory
organization.
Non-uniform memory access times - data residing on a remote node takes longer to access than
node local data.
Distributed Shared Memory Architecture
Distributed Shared Memory Architecture
The largest and fastest computers in the world today employ both shared and distributed memory
architectures.
The shared memory component can be a shared memory machine and/or graphics processing
units (GPU).
The distributed memory component is the networking of multiple shared memory/GPU machines,
which know only about their own memory - not the memory on another machine. Therefore,
network communications are required to move data from one machine to another.
Advantage: Scalability Disadvantage: Programming complexity

CSE4001: Parallel and Distributed Computing Fundamentals

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

CSE4001: Parallel and Distributed Computing Fundamentals

Uploaded by

Copyright:

Available Formats

CSE4001

Parallel and Distributed Computing

Module 1: Parallelism Fundamentals

Module 2: Parallel Architectures

Module 3 : Parallel Algorithm and Design

Module 4: Introduction to Distributed Systems

• The Computational Power Argument – from Transistors to FLOPS

• The Memory/Disk Speed Argument

• The Data Communication Argument

The computing resource includes

• A single computer with multiple processors

• Limits to miniaturization – a limit is reached on processor technology

• Economic limitations – expensive to make single processor faster

• No faster serial computers

Fig. Compute Clusters

• Each compute node is a multi-processor parallel computer in itself

The primary reason for using parallel computing:

Solve larger problems

Applications in Engineering and Design

Applications in Computer Systems

1. Architectural changes (Single core – multi – many core)

• Architecture translates technology’s gifts to performance and capability

• Understanding microprocessor architectural trends

• Four generations of architectural history: Vacuum tube, transistor, IC, VLSI

• Greatest trend in VLSI generation is increase in parallelism

• A serial (non-parallel) computer

A type of parallel computer

• A type of parallel computer

• A type of parallel computer

• Reduce wall clock time

Node - A standalone "computer in a box". Usually comprised of multiple CPUs/processors/cores,

Scalability - Refers to a parallel system's (hardware and/or software) ability to demonstrate a

To achieve the above, we need the following:

• Thread Level Parallelism

Multi-core processor has the above features

• Energy Efficiency • Shared resources

• True concurrency • Interference

• Performance • Concurrency defects

• Reliability and Robustness • Analysis difficulty

• A multiprocessor is a multiple-CPU computer with a shared memory.

• Most commonly represented today by Symmetric Multiprocessor (SMP) machines

• Cache coherence problem

Fig. Cache coherence problem

• Access time varies with the location of the memory word –

Advantage: Scalability Disadvantage: Programming complexity

You might also like