You are on page 1of 63

CSE4001

Parallel and Distributed Computing


Dr.N.Narayanan Prasanth
Associate Professor
Department of Database Systems
School of Computer Science and Engineering (SCOPE)
VIT University Vellore
Mail: n.Prasanth@vit.ac.in
Cabin: SJT116 A40
Objectives
1. To introduce the fundamentals of parallel and distributed computing including
parallel and distributed architectures and paradigms
2. To understand the technologies, system architecture, and communication
architecture that propelled the growth of parallel and distributed computing
systems
3. To develop and execute basic parallel and distributed application using basic
programming models and tools.
Course Outcomes
1. Identify and describe the physical organization of parallel platforms and remember the cost metrics that can be used for
algorithm design.
2. Demonstrate knowledge of the core architectural aspects of Parallel Computing.
3. Indicate key factors that contribute to efficient parallel algorithms and apply a suite of techniques that can be applied
across a wide range of applications.
4. Distinguish the architectural and fundamental models that underpin distributed systems and identify challenges in the
construction of distributed systems.
5. Analyze the issues in synchronizing clocks in distributed systems apply suitable algorithms to solve it effectively and
examine how election algorithms can be implemented in a distributed system.
6. Discuss on how locking, timestamp ordering and optimistic concurrency control may be extended for use with distributed
transactions.
7. Apply the concepts behind software architectures for large scale Web based systems as well as to design, recognize and
evaluate software architectures.
8. Design and build application programs on distributed Systems.
9. Analyze and Implement current distributed Computing research literature.
SYLLABUS

Module 1: Parallelism Fundamentals


Motivation – Key Concepts and Challenges – Overview of Parallel computing – Flynn’s Taxonomy – Multi-Core
Processors – Shared vs Distributed memory.

Module 2: Parallel Architectures


Introduction to OpenMP Programming – Instruction Level Support for Parallel Programming – SIMD – Vector
Processing – GPUs

Module 3 : Parallel Algorithm and Design


Preliminaries – Decomposition Techniques – Characteristics of Tasks and Interactions – Mapping Techniques for
Load balancing – Parallel Algorithm Design

Module 4: Introduction to Distributed Systems


Introduction – Characterization of Distributed Systems – Distributed Shared Memory – Message Passing –
Programming Using the Message Passing Paradigm – Group Communication – Case Study (RPC and Java RMI).

Module 5: Coordination
Time and Global States – Synchronizing Physical Clocks – Logical Time and Logical Clock – Coordination and
Agreement – Distributed Mutual Exclusion – Election Algorithms – Consensus and Related Problems
Module 6:Distributed Transactions
Transaction And Concurrency Control – Nested Transactions – Locks – Optimistic Concurrency Control –
Timestamp Ordering Distributed Transactions – Flat and Nested – Atomic – Two Phase Commit Protocol –
Concurrency Control
Module 7: Distributed System Architecture & its Variants
Distributed File System: Architecture – Processes – Communication Distributed Web-based System:
Architecture – Processes – Communication. Overview of Distributed Computing Platforms.
Module 8: Recent Trends

Text Books
1. Grama, A. (2013). Introduction to parallel computing. Harlow, England: Addison-Wesley.
2. Coulouris, G. (2012). Distributed systems. Boston: Addison-Wesley.
Reference Books
1. Hwang, K., & Briggs, F. (1990). Computer Architecture and Parallel Processing (1st ed.). New York:
McGraw-Hill. 2. Pradeep K. Sinha, “Distributed Operating System: Concepts and Design”, PHI Learning
Pvt. Ltd., 2007.
2. Hager, G. (2017). Introduction To high performance computing for scientists and engineers: CRC press.
3. Tanenbaum, A. (2007). Distributed Systems: Principles and Paradigms (2nd ed.). Pearson.
“Parallelism is the future of computing”
Motivation

• The Computational Power Argument – from Transistors to FLOPS


Gordon Moore specified the Moore's law.
“It is the observation that the number of transistors in a dense integrated circuit doubles
about every two years”

• The Memory/Disk Speed Argument


• gap between processor speed and memory presents a tremendous performance bottleneck

• The Data Communication Argument


Parallel Computing
Parallel computing is the use of two or more processors in combination to solve a single
problem.
Problems that require large amount of computation, memory or both

The computing resource includes

• A single computer with multiple processors


• A single computer with multiple processors and some specialized compute resources like GPU,
FGPA, etc.
• An arbitrary number of computers connected by a network
• A combination of above
Overview: Serial vs Parallel Computing
Limitations in Serial Computing

• Transmission speed

• Limits to miniaturization – a limit is reached on processor technology

• Economic limitations – expensive to make single processor faster

• No faster serial computers


Parallel computing is the simultaneous use of multiple compute resources to solve a
computational problem:
Parallel Computers
• A parallel computer is a set of processors that are able to work cooperatively to solve a
computational problem.
• Requires a computer to be able to process large amounts of data for applications such as
video conferencing, collaborative work environments, computer-aided diagnosis in
medicine, virtual reality, etc.
• Networks connect multiple stand-alone computers (nodes) to make larger parallel computer
clusters.

Fig. Compute Clusters


Fig: IBM BG/Q Compute
Chip with 18 cores (PU) and
16 L2 Cache units (L2)
Example: Typical Lawrence Livermore National Laboratory (LLNL) parallel
computer cluster

• Each compute node is a multi-processor parallel computer in itself


• Multiple compute nodes are networked together with an InfiniBand network
Need for Parallel Computing

The primary reason for using parallel computing:

Solve larger problems


Save time
Provide concurrency
Effective utilization of non-local resources
Cost savings
Overcoming memory constraints
Scope of Parallel Programming

Parallel computing has made a tremendous impact on a variety of areas ranging from computational
simulations for scientific and engineering applications to commercial applications in data mining and
transaction processing.

Applications in Engineering and Design


design of air foils, internal combustion engines, high-speed circuits, etc.

Scientific Applications
Functional and structural characterization of genes and proteins, sequencing of the human genome,
Analysing biological sequences with a view to developing new drugs, etc.

Commercial Applications
cost-effective servers, Linux clusters, etc.

Applications in Computer Systems


Cryptography, distributed control algorithms for embedded systems, etc.
Scope of Parallel Programming
Challenges

1. Architectural changes (Single core – multi – many core)


2. Need for optimal algorithms
3. Seeking Concurrency
4. Achieving parallel speedup
5. Granularity of partitioning among processors (Proper Scheduling)
Communication cost and load balancing cost
6. Locality of computation and communication
Communication between processor or between processor and memory
7. Resource allocation
8. Data access, communication and Synchronization
9. Performance and Scalability
10. Error recovery and support for fault tolerance
Architectural Trends

• Architecture translates technology’s gifts to performance and capability


• Resolves the tradeoff between parallelism and locality
• Current microprocessor: 1/3 compute, 1/3 cache, 1/3 off-chip connect
• Tradeoffs may change with scale and technology advances

• Understanding microprocessor architectural trends


• Helps build intuition about design issues or parallel machines
• Shows fundamental role of parallelism even in “sequential” computers

• Four generations of architectural history: Vacuum tube, transistor, IC, VLSI


Architectural Trends

• Greatest trend in VLSI generation is increase in parallelism


• Up to 1985: bit level parallelism: 4-bit -> 8 bit -> 16-bit
– slows after 32 bit
– adoption of 64-bit almost complete, 128-bit far (not performance issue)
– great inflection point when 32-bit micro and cache fit on a chip
• Mid 80s to mid 90s: instruction level parallelism
– pipelining and simple instruction sets, + compiler advances (RISC)
– on-chip caches and functional units => superscalar execution
– greater sophistication: out of order execution, speculation, prediction
» to deal with control transfer and latency problems
• thread level parallelism - Multithreading
Concepts
Flynn's Classical Taxonomy - different ways to classify parallel computers
• distinguishes multi-processor computer architectures according to how they can be classified
along the two independent dimensions of Instruction Stream and Data Stream.
• Each of these dimensions can have only one of two possible states: Single or Multiple
Concepts and Terminology
Single Instruction, Single Data (SISD)

• A serial (non-parallel) computer


• Single Instruction: Only one instruction stream is being acted on by the CPU during any one
clock cycle
• Single Data: Only one data stream is being used as input during any one clock cycle
• Deterministic execution
• This is the oldest type of computer
• Examples: older generation mainframes, minicomputers, workstations and single
processor/core PCs
Concepts and Terminology
Concepts and Terminology
Single Instruction, Multiple Data (SIMD)

A type of parallel computer


• Single Instruction: All processing units execute the same instruction at any given clock cycle
• Multiple Data: Each processing unit can operate on a different data element
• Best suited for specialized problems characterized by a high degree of regularity, such as graphics/image
processing.
Two varieties: Processor Arrays and Vector Pipelines
• Examples:
• Processor Arrays: Thinking Machines CM-2, MasPar MP-1 & MP-2, ILLIAC IV
• Vector Pipelines: IBM 9000, Cray X-MP, Y-MP & C90, Fujitsu VP, NEC SX-2, Hitachi S820, ETA10

• Most modern computers, particularly those with graphics processor units (GPUs) employ SIMD
instructions and execution units.
Concepts and Terminology
Concepts and Terminology
Multiple Instruction, Single Data (MISD)

• A type of parallel computer


• Multiple Instruction: Each processing unit operates on the data independently via separate instruction streams.
• Single Data: A single data stream is fed into multiple processing units.
• Few (if any) actual examples of this class of parallel computer have ever existed.
• Some conceivable uses might be:
• multiple frequency filters operating on a single signal stream
• multiple cryptography algorithms attempting to crack a single coded message.
Concepts and Terminology
Concepts and Terminology
Multiple Instruction, Multiple Data (MIMD)

• A type of parallel computer


• Multiple Instruction: Every processor may be executing a different instruction stream
• Multiple Data: Every processor may be working with a different data stream
• Execution can be synchronous or asynchronous, deterministic or non-deterministic
• Currently, the most common type of parallel computer - most modern supercomputers fall into this category.
• Examples: most current supercomputers, networked parallel computer clusters and "grids", multi-processor SMP
computers, multi-core PCs.
• Note: many MIMD architectures also include SIMD execution sub-components
Concepts and Terminology
GOALS of Parallel Computing

• Reduce wall clock time


• Cheapest Possible Solution Strategy many workstations cost much less then a
large mainframe-style mega-beast
• Locality - processors should be “close” to one another in terms of
communication the more communication latency incurred the longer it will
run
• Memory Constraints – one of the resources to exhaust quickly
Communication
Parallel tasks typically need to exchange data.
There are several ways this can be accomplished, such as through a shared memory bus
or over a network, however the actual event of data exchange is commonly referred to
as communications regardless of the method employed
Shared memory describes a computer architecture where all processors have direct
(usually bus based) access to common physical memory

Synchronization
The coordination of parallel tasks in real time, very often associated with
communications.
Often implemented by establishing a synchronization point within an application
where a task may not proceed further until another task(s) reaches the same or
logically equivalent point
Synchronization usually involves waiting by at least one task, and can therefore cause a
parallel application's wall clock execution time to increase
Some important terminologies

Node - A standalone "computer in a box". Usually comprised of multiple CPUs/processors/cores,


memory, network interfaces, etc. Nodes are networked together to comprise a supercomputer
Task - A logically discrete section of computational work. A task is typically a program or
program-like set of instructions that is executed by a processor
Pipelining - Breaking a task into steps performed by different processor units, with inputs
streaming through, much like an assembly line; a type of parallel computing
• Granularity - In parallel computing, granularity is a qualitative measure of the ratio of
computation to communication.
• Coarse: relatively large amounts of computational work are done between communication events
• Fine: relatively small amounts of computational work are done between communication events

Scalability - Refers to a parallel system's (hardware and/or software) ability to demonstrate a


proportionate increase in parallel speedup with the addition of more resources
Multicore Processors

NNP 38
Multicore Processors - Introduction
A processor core is a processing unit which read instructions to perform specific operations

A multi-core processor is a single computing component with two or more independent processing
units called cores, which read and execute program instructions

A many-core processor is a system with hundreds or thousands of cores and implements parallel
architecture
Core types:
1. Homogeneous (symmetric) cores
2. Heterogeneous (asymmetric) cores – have a mix of core types that run different OS..

NNP 39
Single Core CPU

NNP 40
Why Multi-core?
Performance of a processor depends on
• No. of instructions executed
• Clock cycles per instruction
• Clock frequency / period

To achieve the above, we need the following:


• Instruction Level Parallelism
• Pipelining

• Thread Level Parallelism


• Multithreading

Multi-core processor has the above features

NNP 41
Multi-core architecture

NNP 42
About Multi-core…..
• First multi-core processor by Intel in 2005
• MIMD
• It uses either homogeneous / heterogeneous configuration

Various generations:
• Dual core - eg. Intel Core Duo
• Quad core – eg. Intel i5 processor
• Hexa core - eg. Intel i7 extreme edition 980x processor
• Octa core – eg. Intel Xeon E7-2820 processor
• Deca core – eg. Intel Xeon E7 – 2850 processor
NNP 43
Example: Quad-core processor

NNP 44
NNP 45
NNP 46
Homogeneous Multicore Processor
Heterogeneous Multicore Processor
NNP 49
Pro’s Cons

• Energy Efficiency • Shared resources

• True concurrency • Interference

• Performance • Concurrency defects

• Isolation • Non-determinism

• Reliability and Robustness • Analysis difficulty

• Obsolescence Avoidance
• Hardware cost
Shared vs Distributed Memory
Shared Memory Architecture

• A multiprocessor is a multiple-CPU computer with a shared memory.


• The ability for all processors to access all memory as global address space
• Multiple processors can operate independently but share the same memory resources
• Changes in a memory location effected by one processor are visible to all other processors
• Shared memory machines have been classified as UMA and NUMA, based upon memory
access times
Uniform Memory Access (UMA)
• Fig. Shared memory (UMA)
Uniform Memory Access (UMA)

• Most commonly represented today by Symmetric Multiprocessor (SMP) machines


• Identical processors
• All the processors take equal memory access time

• The presence of large, efficient instruction and data caches reduces the load a single processor puts
on the memory bus and memory, allowing these resources to be shared among on multiple
processors.

• Private data are data items used only by a single processor, while shared data are data values used
by multiple processors
Uniform Memory Access (UMA)

• Cache coherence problem


• Solution: Write invalidate /write update (snooping protocol types)
• Processor Synchronization – Mutual exclusion / barrier synchronization

Fig. Cache coherence problem


Non Uniform Memory Access (NUMA)
Non Uniform Memory Access (NUMA)

• The shared memory is physically distributed among all the processors, called local memories

• The collection of all local memories forms a global address space which can be accessed by all the
processors.

• Access time varies with the location of the memory word –


• Most memory access are local and fast
• Memory access across link is slower
• If cache coherency is maintained, then may also be called CC-NUMA - Cache Coherent NUMA
Shared Memory Architecture
Advantages:
• Global address space provides a user-friendly programming perspective to memory
• Data sharing between tasks is both fast and uniform due to the proximity of memory to CPUs

Disadvantages:
• Lack of scalability between memory and CPUs. Adding more CPUs can geometrically increases
traffic on the shared memory-CPU path, and for cache coherent systems, geometrically increase
traffic associated with cache/memory management.
• Programmer responsibility for synchronization constructs that ensure "correct" access of global
memory.
Distributed Memory Architecture
• Processors have their own local memory.
• Memory addresses in one processor do not map to another processor, so there is no concept of global
address space across all processors.
• Changes it makes to its local memory have no effect on the memory of other processors. Hence, the
concept of cache coherency does not apply.
• When a processor needs access to data in another processor, it is usually the task of the programmer to
explicitly define how and when data is communicated.
• Synchronization between tasks is likewise the programmer's responsibility.

Advantages:
Memory is scalable with the number of processors.
Increase the number of processors and the size of memory increases proportionately.
Each processor can rapidly access its own memory without interference and without the overhead incurred
with trying to maintain global cache coherency.
Cost effectiveness: can use commodity, off-the-shelf processors and networking.
Distributed Memory Architecture
Distributed Memory Architecture

Disadvantages:
The programmer is responsible for many of the details associated with data communication
between processors.
It may be difficult to map existing data structures, based on global memory, to this memory
organization.
Non-uniform memory access times - data residing on a remote node takes longer to access than
node local data.
Distributed Shared Memory Architecture
Distributed Shared Memory Architecture

The largest and fastest computers in the world today employ both shared and distributed memory
architectures.

The shared memory component can be a shared memory machine and/or graphics processing
units (GPU).

The distributed memory component is the networking of multiple shared memory/GPU machines,
which know only about their own memory - not the memory on another machine. Therefore,
network communications are required to move data from one machine to another.

Advantage: Scalability Disadvantage: Programming complexity

You might also like