You are on page 1of 14

HAWASSA UNIVERSITY

INSTUTE OF TECHNOLOGY
FACULITY OF INFORMAT
COMPUTER SCIENCE DEPARTMENT

PARALLEL PROCESSE ASSIGNMENT

Introduction
1. What is Parallel Computing?
Parallel computing is a type of computation in which many calculations or the execution of processes are
carried out simultaneously. Large problems can often be divided into smaller ones, which can then be
solved at the same time. There are several different forms of parallel computing: bit-level, instruction-
level, data, and task parallelism. Parallelism has long been employed in high-performance computing, but
it's gaining broader interest due to the physical constraints preventing frequency scaling. As power
consumption (and consequently heat generation) by computers has become a concern in recent years,
parallel computing has become the dominant paradigm in computer architecture, mainly in the form of
multi-core processors.
Parallel computing is closely related to concurrent computing—they are frequently used together, and
often conflated, though the two are distinct: it is possible to have parallelism without concurrency (such
as bitlevel parallelism), and concurrency without parallelism (such as multitasking by time-sharing on a
single core CPU). In parallel computing, a computational task is typically broken down into several, often
many, very similar sub-tasks that can be processed independently and whose results are combined
afterwards, upon completion. In contrast, in concurrent computing, the various processes often do not
address related tasks; when they do, as is typical in distributed computing, the separate tasks may have a
varied nature and often require some inter-process communication during execution.
Parallel computers can be roughly classified according to the level at which the hardware supports
parallelism, with multi-core and multi-processor computers having multiple processing elements within a
single machine, while clusters, MPPs, and grids use multiple computers to work on the same task.
Specialized parallel computer architectures are sometimes used alongside traditional processors, for
accelerating specific tasks.

Parallel Computing: Background


Parallel computing is the Computer Science discipline that deals with the system architecture and
software issues related to the concurrent execution of applications. It has been an area of active research
interest and application for decades, mainly the focus of high performance computing, but is now
emerging as the prevalent computing paradigm due to the semiconductor industry’s shift to multicore
processors
A Brief History of Parallel Computing
The interest in parallel computing dates back to the late 1950’s, with advancements surfacing in the form
of supercomputers throughout the 60’s and 70’s. These were shared memory multiprocessors, with
multiple processors working side-by-side on shared data. In the mid 1980’s, a new kind of parallel
computing was launched when the Caltech Concurrent Computation project built a supercomputer for
scientific applications from 64 Intel 8086/8087 processors. This system showed that extreme
performance could be achieved with mass market, off the shelf microprocessors. These massively parallel
processors (MPPs) came to dominate the top end of computing, with the ASCI Red supercomputer
computer in 1997 breaking the barrier of one trillion floating point operations per second. Since then,
MPPs have continued to grow in size and power.
Starting in the late 80’s, clusters came to compete and eventually displace MPPs for many applications.
A cluster is a type of parallel computer built from large numbers of off-the-shelf computers connected by
an off-the-shelf network. Today, clusters are the workhorse of scientific computing and are the dominant
architecture in the data centers that power the modern information age.

1
Today, parallel computing is becoming mainstream based on multi-core processors. Most desktop and
laptop systems now ship with dual-core microprocessors, with quad-core processors readily available.
Chip manufacturers have begun to increase overall processing performance by adding additional CPU
cores. The reason is that increasing performance through parallel processing can be far more energy-
efficient than increasing microprocessor clock frequencies. In a world which is increasingly mobile and
energy conscious, this has become essential. Fortunately, the continued transistor scaling predicted by
Moore’s Law will allow for a transition from a few cores to many.

Parallel Software
The software world has been very active part of the evolution of parallel computing. Parallel programs
have been harder to write than sequential ones. A program that is divided into multiple concurrent tasks is
more difficult to write, due to the necessary synchronization and communication that needs to take place
between those tasks. Some standards have emerged. For MPPs and clusters, a number of application
programming interfaces converged to a single standard called MPI by the mid 1990’s. For shared
memory multiprocessor computing, a similar process unfolded with convergence around two standards by
the mid to late 1990s: pthreads and OpenMP. In addition to these, a multitude of competing parallel
programming models and languages have emerged over the years. Some of these models and languages
may provide a better solution to the parallel programming problem than the above “standards”, all of
which are modifications to conventional, non-parallel languages like C.
As multi-core processors bring parallel computing to mainstream customers, the key challenge in
computing today is to transition the software industry to parallel programming. The long history of
parallel software has not revealed any “silver bullets,” and indicates that there will not likely be any
single technology that will make parallel software ubiquitous. Doing so will require broad collaborations
THE MANYCORE SHIFT: Microsoft Parallel Computing Initiative Ushers Computing into the Next Era
across industry and academia to create families of technologies that work together to bring the power of
parallel computing to future mainstream applications. The changes needed will affect the entire industry,
from consumers to hardware manufacturers and from the entire software development infrastructure to
application developers who rely upon it.
Future capabilities such as photorealistic graphics, computational perception, and machine learning really
heavily on highly parallel algorithms. Enabling these capabilities will advance a new generation of
experiences that expand the scope and efficiency of what users can accomplish in their digital lifestyles
and work place. These experiences include more natural, immersive, and increasingly multi-sensory
interactions that offer multi-dimensional richness and context awareness. The future for parallel
computing is bright, but with new opportunities come new challenges.
Performance Metrics for Parallel Systems
It is important to study the performance of parallel programs with a view to determining the best algorithm,
evaluating hardware platforms, and examining the benefits from parallelism. A number of metrics have been used
based on the desired outcome of performance analysis.

2
1 Execution Time
The serial runtime of a program is the time elapsed between the beginning and the end of its execution on a
sequential computer. The parallel runtime is the time that elapses from the moment a parallel computation starts to
the moment the last processing element finishes execution. We denote the serial runtime by  TS and the parallel
runtime by TP.

2 Total Parallel Overhead


The overheads incurred by a parallel program are encapsulated into a single expression referred to as the  overhead
function. We define overhead function or total overhead of a parallel system as the total time collectively spent by
all the processing elements over and above that required by the fastest known sequential algorithm for solving the
same problem on a single processing element. We denote the overhead function of a parallel system by the
symbol To.

The total time spent in solving a problem summed over all processing elements is pTP . TS units of this time are spent
performing useful work, and the remainder is overhead. Therefore, the overhead function (To) is given by

Equation 1

3 Speedup
When evaluating a parallel system, we are often interested in knowing how much performance
gain is achieved by parallelizing a given application over a sequential implementation. Speedup
is a measure that captures the relative benefit of solving a problem in parallel. It is defined as the
ratio of the time taken to solve a problem on a single processing element to the time required to
solve the same problem on a parallel computer with p identical processing elements. We denote
speedup by the symbol S.
4 Efficiency
Only an ideal parallel system containing p processing elements can deliver a speedup equal to p.
In practice, ideal behavior is not achieved because while executing a parallel algorithm, the
processing elements cannot devote 100% of their time to the computations of the algorithm. Part
of the time required by the processing elements to compute the sum of n numbers is spent idling
(and communicating in real systems). Efficiency is a measure of the fraction of time for which a
processing element is usefully employed; it is defined as the ratio of speedup to the number of
processing elements. In an ideal parallel system, speedup is equal to p and efficiency is equal to
one. In practice, speedup is less than p and efficiency is between zero and one, depending on the
effectiveness with which the processing elements are utilized. We denote efficiency by the
symbol E.

5 Cost
We define the cost of solving a problem on a parallel system as the product of parallel runtime and the number of
processing elements used. Cost reflects the sum of the time that each processing element spends solving the
problem. Efficiency can also be expressed as the ratio of the execution time of the fastest known sequential
algorithm for solving a problem to the cost of solving the same problem on p processing elements.

3
The cost of solving a problem on a single processing element is the execution time of the fastest known sequential
algorithm. A parallel system is said to be cost-optimal if the cost of solving a problem on a parallel computer has the
same asymptotic growth (in Q terms) as a function of the input size as the fastest-known sequential algorithm on a
single processing element. Since efficiency is the ratio of sequential cost to parallel cost, a cost-optimal parallel
system has an efficiency of Q(1).

Cost is sometimes referred to as work or processor-time product, and a cost-optimal system is also known as a pTP -
optimal system.

2. Explain Super-Scalar execution with the help of an example.


Superscalar processor
A superscalar processor is a CPU that implements a form of parallelism called instruction-level
parallelism within a single processor. In contrast to a scalar processor that can execute at most
one single instruction per clock cycle, a superscalar processor can execute more than one
instruction during a clock cycle by simultaneously dispatching multiple instructions to different
execution units on the processor. It therefore allows for more throughput (the number of
instructions that can be executed in a unit of time) than would otherwise be possible at a given
clock rate. Each execution unit is not a separate processor (or a core if the processor is a multi-
core processor), but an execution resource within a single CPU such as an arithmetic logic unit.
In Flynn's taxonomy, a single-core superscalar processor is classified as an SISD processor
(Single Instruction stream, Single Data stream), though a single-core superscalar processor that
supports short vector operations could be classified as SIMD (Single Instruction stream, Multiple
Data streams). A multi-core superscalar processor is classified as an MIMD processor (Multiple
Instruction streams, Multiple Data streams).
While a superscalar CPU is typically also pipelined, superscalar and pipelining execution are
considered different performance enhancement techniques. The former executes multiple
instructions in parallel by using multiple execution units, whereas the latter executes multiple
instructions in the same execution unit in parallel by dividing the execution unit into different
phases.
The superscalar technique is traditionally associated with several identifying characteristics
(within a given CPU):
Instructions are issued from a sequential instruction stream The CPU dynamically checks for
data dependencies between instructions at run time (versus software checking at compile time)
The CPU can execute multiple instructions per clock cycle
Superscalar execution
The processor as it is visible to the programmer. The register files, the processor status word, the
ALU, and other parts of the programming model are all there to provide a means for the
programmer to manipulate the processor and make it do useful work. In other words, the
programming model is essentially a user interface for the CPU.

4
Much like the graphical user interfaces on modern computer systems, there's a lot more going on
"under the hood" than the simplicity of the interface would imply. In my article
on multithreading, super threading and hyper-threading, I talked about the various ways in which
the OS and processor collaborate to fool the user into thinking that he or she is executing
multiple programs at once. There's a similar sort of trickery that goes on beneath the
programming model in a modern microprocessor, but it's intended to fool the programmer into
thinking that there's only one thing going on at a time, when really there are multiple things
happening simultaneously. Let me explain.
Back in the days when you could fit only a few transistors on a single die, many of the parts of
the programming model actually fit on separate chips attached to a single circuit board. For
instance, one chip would contain the ALU, another the control unit, another the registers, etc.
Such computers were obviously quite slow, and the fact that they were made of multiple chips
made them expensive. Each chip had its own manufacturing and packaging costs, so the fewer
chips you put on a board the cheaper the overall system was. Each chip had its own
manufacturing and packaging costs, and then there was the cost and complexity of putting them
all together on a single circuit board. (Note that this is still true, today. The cost of producing
systems and components can be drastically reduced by packing the functionality of multiple
chips into a single chip.)
With the advent of the Intel 4004 in 1971, all of that changed. The 4004 was the world's first
microprocessor on a chip. Designed to be the brains of a calculator manufactured by a now
defunct company named Busicom, the 4004 had sixteen 4-bit registers, an ALU, decoding and
control logic all packed onto a single, 2,300 transistor chip. The 4004 was quite a feat for its day,
and it paved the way for the PC revolution. However, it wasn't until Intel released the 8080 four
years later that the world saw the first true general purpose CPU
During the decades following the 4004, transistor densities increased at a stunning pace. As CPU
designers had more and more transistors to work with when designing new chips, they began to
think up novel ways for using those transistors to increase computing performance on application
code. One of the first things that occurred to designers was that they could put more than one
ALU a chip, and have both ALUs working in parallel to process code faster. Since these designs
could do more than one scalar (or integer, for our purposes) operation at once, they were
called superscalar computers. The RS6000 from IBM was released in 1990 and was the world's
first superscalar CPU. Intel followed in 1993 with the Pentium, which with its two ALUs
brought the x86 world into the superscalar era.

3. Differentiate between SIMD and MIMD?


Definition of SIMD
SIMD stands for Single Instruction Multiple Data Streams which is a form of parallel
architecture categorized under Flynn’s classification. In this architecture, a single instruction is
applied to a group of the data stream or distinct data at the same time. It has a single control unit
that is used to evoke several isolated processing units. SIMD assists vector processing where a
single control unit provides guidance for functioning over all of the execution units. All the

5
execution units accept the same instruction from the control unit but perform on separate
elements of data. The shared memory unit is divided into modules so that it can interact with all
the processors simultaneously.
Definition of MIMD

As the name suggests the MIMD (Multiple Instruction Multiple Data Stream) involves


computers having multiple processing units, instruction streams and data streams. This
architecture applies multiple instructions over different data simultaneously. MIMD machines
are considered as the most complex configuration but it also ensures efficiency. It provides high
concurrency where in addition to the concurrent operation of processors, multiple processors are
also executed in the same time frame concurrent to each other. It efficiently works with shared
and distributed memory model.

S.NO SIMD MIMD

SIMD stands for Single Instruction While MIMD stands for Multiple Instruction

1. Multiple Data. Multiple Data.

2. SIMD requires small or less memory. While it requires more or large memory.

3. The cost of SIMD is less than MIMD. While it is costlier than SIMD.

4. It has single decoder. While it have multiple decoders.

5. It is latent or tacit synchronization. While it is accurate or explicit synchronization.

6. SIMD is a synchronous programming. While MIMD is an asynchronous programming.

SIMD is a simple in terms of complexity While MIMD is complex in terms of

7. than MIMD. complexity than SIMD.

SIMD is less efficient in terms of While MIMD is more efficient in terms of

8. performance than MIMD. performance than SIMD.

6
Difference between UMA and NUMA
Definition of UMA
UMA (Uniform Memory Access) system is a shared memory architecture for the
multiprocessors. In this model, a single memory is used and accessed by all the processors
present the multiprocessor system with the help of the interconnection network. Each processor
has equal memory accessing time (latency) and access speed. It can employ either of the single
bus, multiple bus or crossbar switch. As it provides balanced shared memory access, it is also
known as SMP (Symmetric multiprocessor) systems.
Definition of NUMA
NUMA (Non-uniform Memory Access) is also a multiprocessor model in which each processor
connected with the dedicated memory. However, these small parts of the memory combine to
make a single address space. The main point to ponder here is that unlike UMA, the access time
of the memory relies on the distance where the processor is placed which means varying memory
access time. It allows access to any of the memory location by using the physical address.

Sr. Key UMA NUMA


No.

1 Definition UMA stands for Uniform Memory Access. NUMA stands for Non Uniform Memory Access.

Memory UMA has single memory controller. NUMA has multiple memory controllers.
2
Controller

Memory UMA memory access is slow. NUMA memory accsss is faster than UMA memory.
3
Access

4 Bandwidth UMA has limited bandwidth. NUMA has more bandwidth than UMA.

Suitability UMA is used in general purpose and time NUMA is used in real time and time critical
5
sharing applications. applications.

Memory UMA has equal memory access time. NUMA has varying memory access time.
6 Access
time

Bus types 3 types of Buses supported: Single, Multiple 2 types of Buses supported: Tree,
7
and Crossbar. hierarchical.

7
Difference between Static and Dynamic Network in prallel system
Static interconnection networks
Static interconnection networks for elements of parallel systems (ex. processors, memories) are
based on fixed connections that cannot be modified without a physical re-designing of a system.
Static interconnection networks can have many structures such as a linear structure (pipeline), a
matrix, a ring, a torus, a complete connection structure, a tree, a star, a hyper-cube.

Dynamic interconnection networks


Dynamic interconnection networks between processors enable changing (reconfiguring) of the
connection structure in a system. It can be done before or during parallel program execution. So,
we can speak about static or dynamic connection reconfiguration.

Difference between S-RAM & D-RAM


Definition of SRAM
SRAM (Static Random Access Memory) is made up of CMOS technology and uses six
transistors. Its construction is comprised of two cross-coupled inverters to store data (binary)
similar to flip-flops and extra two transistors for access control. It is relatively faster than other
RAM types such as DRAM. It consumes less power. SRAM can hold the data as long as power
is supplied to it.

Definition of DRAM
DRAM (Dynamic Random Access Memory) is also a type of RAM which is constructed using
capacitors and few transistors. The capacitor is used for storing the data where bit value 1
signifies that the capacitor is charged and a bit value 0 means that capacitor is discharged.
Capacitor
Comparision SRAM DRAM tends to
discharge,
Cost Expensive Cheap which result
in leaking
of charges.
Used in Cache memory Main memory

Density Less dense Highly dense

Construction Complex and uses transistors and Simple and uses capacitors and very few

latches. transistors.

Single block of memory 6 transistors Only one transistor.

requires

Charge leakage property Not present Present hence require power refresh

8 circuitry

Power consumption Low High


Difference between RISC & CISC processor

RISC Architecture

The term RISC stands for ‘’Reduced Instruction Set Computer’’. It is a CPU design plan based
on simple orders and acts fast.

This is small or reduced set of instructions. Here, every instruction is expected to attain very
small jobs. In this machine, the instruction sets are modest and simple, which help in comprising
more complex commands. Each instruction is about the similar length; these are wound together
to get compound tasks done in a single operation. Most commands are completed in one machine
cycle. This pipelining is a crucial technique used to speed up RISC machines.

CISC Architecture

The term CISC stands for ‘’Complex Instruction Set Computer’’. It is a CPU design plan based
on single commands, which are skilled in executing multi-step operations.

CISC computers have small programs. It has a huge number of compound instructions, which
takes a long time to perform. Here, a single set of instruction is protected in several steps; each
instruction set has additional than 300 separate instructions. Maximum instructions are finished
in two to ten machine cycles. In CISC, instruction pipelining is not easily implemented.

9
Assignment 2
Problem 2.1 A 40-MHz processor was used to execute a benchmark program with the following
instruction mix and clock cycle counts:

Clock
Instructio
Instruction type cycle
n count
count

Integer arithmetic 45000 1

Data transfer 32000 2

Floating point 15000 2

Control transfer 8000 2

Determine the effective CPI, MIPS rate, and execution time for this program.

Solution
CPI
CPI=∑^n i=1(cpi*Li)
Ic
Here,
CPI =cycle per instruction
IC= instruction count

CPI=(1*45000)+(2*32000)+(2*150000)+(2*8000)
100000
CPI=45000+64000+30000+16000
100000
CPI=1550000

10
100000
CPI=1.55ans.
MIPS RATE
MIPS= ___IC____
T*10^6
= f/CPI*10^6
= ____IC______
IC*CPI*T*10^6
= ____I______
CPI*1/f*10^6
= _____f______
CPI*10^6
MIPS=40*10^6 /1.55*10^6
=25.8064

EXECUTION TIME
T=IC*CPI*T
=100 000*1.55*1/f
=100,000*1.55*1/40*10^6
=0.003875ans.

Problem 2.2: What is pipelining? Describe the speed up gain due to pipelining.
What are the various factors that affect Throughput of an Instruction Pipeline?
What is Pipeline (computing)
In computing, a pipeline, also known as a data pipeline, is a set of data processing elements
connected in series, where the output of one element is the input of the next one. The elements of
a pipeline are often executed in parallel or in time-sliced fashion. Some amount of buffer storage
is often inserted between elements.
Computer-related pipelines include:
Instruction pipelines, such as the classic RISC pipeline, which are used in central processing
units (CPUs) and other microprocessors to allow overlapping execution of multiple instructions
11
with the same circuitry. The circuitry is usually divided up into stages and each stage processes a
specific part of one instruction at a time, passing the partial results to the next stage. Examples of
stages are instruction decode, arithmetic/logic and register fetch. They are related to the
technologies of superscalar execution, operand forwarding, speculative execution and out-of-
order execution.
 Graphics pipelines, found in most graphics processing units (GPUs), which consist of
multiple arithmetic units, or complete CPUs, that implement the various stages of
common rendering operations (perspective projection, window clipping, color and light
calculation, rendering, etc.).
 Software pipelines, which consist of a sequence of computing processes (commands,
program runs, tasks, threads, procedures, etc.), conceptually executed in parallel, with the
output stream of one process being automatically fed as the input stream of the next one.
The Unix system call pipe is a classic example of this concept.
 HTTP pipelining, the technique of issuing multiple HTTP requests through the same TCP
connection, without waiting for the previous one to finish before issuing a new one.

Problem 2.3: What is Cache Mapping? How and which mapping affect the speed in parallel
execution of task?

Cache Mapping
 

 Cache mapping defines how a block from the main memory is mapped to the cache
memory in case of a cache miss.
OR
 Cache mapping is a technique by which the contents of main memory are brought into
the cache memory.
NOTES
 
 Main memory is divided into equal size partitions called as blocks or frames.
 Cache memory is divided into partitions having same size as that of blocks
called as lines.
 During cache mapping, block of main memory is simply copied to the cache
and the block is not actually brought from the main memory.
 

12
Cache Mapping Techniques-
 
Cache mapping is performed using following three different techniques

1. Direct Mapping
2. Fully Associative Mapping
3. K-way Set Associative Mapping

13

You might also like