You are on page 1of 22

U21CS603

[UNIT I – PARALLELISM FUNDAMENTALS AND


DS ARCHITECTURE ]

UNIT I: PARALLELISM FUNDAMENTALS AND ARCHITECTURE


Motivation – Key Concepts and Challenges – Overview of Parallel computing – Flynn’s Taxonomy
– Multi-core processors – Shared vs Distributed memory. Introduction to OpenMP programming –
Instruction Level Support for Parallel Programming – SIMD – Vector Processing – GPUs.

PARALLEL COMPUTING
Parallel computing is a computing
architecture that involves executing multiple
processors on an application or computation
simultaneously.

1. Motivation

Development of parallel software has traditionally been thought of as time and effort intensive. This
can be largely attributed to the inherent complexity of specifying and coordinating concurrent tasks, a
lack of portable algorithms, standardized environments, and software development toolkits. It takes
two years to develop a parallel application, during which time the underlying hardware and/or software
platform has become obsolete, the development effort is clearly wasted. However, there are some
unmistakable trends in hardware design, which indicate that uniprocessor (or implicitly parallel)
architectures may not be able to sustain the rate of realizable performance increments in the future.

This is a result of lack of implicit parallelism as well as other bottlenecks such as the datapath and the
memory. At the same time, standardized hardware interfaces have reduced the turnaround time from
the development of a microprocessor to a parallel machine based on the microprocessor.

2. Key Concepts and Challenges

Moore's Law states that circuit complexity doubles every eighteen months. This empirical relationship
has been amazingly resilient over the years both for microprocessors as well as for DRAMs. By relating
component density and increases in die-size to the computing power of a device, Moore's law has been
extrapolated to state that the amount of computing power available at a given cost doubles
approximately every 18 months.

➢ Memory/disk Speed Argument: clock rates of high-end processors have increased at roughly
40% per year over the past decade, DRAM access times have only improved at the rate of
roughly 10% per year over this interval. Coupled with increases in instructions executed per
clock cycle, this gap between processor speed and memory presents a tremendous performance
bottleneck. he overall performance of the memory system is determined by the fraction of the
total memory requests that can be satisfied from the cache. Parallel platforms typically yield
better memory system performance because they provide (i) larger aggregate caches, and (ii)

Prepared By: Dr. Vishnu Kumar K Professor/CSE Department, KPRIET, Coimbatore. Page 1
U21CS603
[UNIT I – PARALLELISM FUNDAMENTALS AND
DS ARCHITECTURE ]

higher aggregate bandwidth to the memory system (both typically linear in the number of
processors).
➢ Data Communication Argument: As the networking infrastructure evolves, the vision of
using the Internet as one large heterogeneous parallel/distributed computing environment has
begun to take shape. Many applications lend themselves naturally to such computing paradigms.
Some of the most impressive applications of massively parallel computing have been in the
context of wide-area distributed platforms.

3. Overview of Parallel computing (Scope)

Parallel computing has made a tremendous impact on a variety of areas ranging from computational
simulations for scientific and engineering applications to commercial applications in data mining and
transaction processing. The cost benefits of parallelism coupled with the performance requirements of
applications present compelling arguments in favor of parallel computing. We present a small sample
of the diverse applications of parallel computing.

➢ Applications in Engineering and Design: Parallel computing has traditionally been employed
with great success in the design of airfoils (optimizing lift, drag, stability), internal combustion
engines (optimizing charge distribution, burn), high-speed circuits (layouts for delays and
capacitive and inductive effects), and structures (optimizing structural integrity, design
parameters, cost, etc.), among others. More recently, design of microelectromechanical and
nanoelectromechanical systems (MEMS and NEMS) has attracted significant attention. This
presents formidable challenges for geometric modeling, mathematical modeling, and algorithm
development, all in the context of parallel computers.
➢ Scientific Applications: The past few years have seen a revolution in high performance
scientific computing applications. The sequencing of the human genome by the International
Human Genome Sequencing Consortium and Celera, Inc. has opened exciting new frontiers in
bioinformatics. Functional and structural characterization of genes and proteins hold the
promise of understanding and fundamentally influencing biological processes. Analyzing
biological sequences with a view to developing new drugs and cures for diseases and medical
conditions requires innovative algorithms as well as large-scale computational power.
➢ Applications in Computer Systems: As computer systems become more pervasive and
computation spreads over the network, parallel processing issues become engrained into a
variety of applications. In computer security, intrusion detection is an outstanding challenge. In
the case of network intrusion detection, data is collected at distributed sites and must be
analyzed rapidly for signaling intrusion. The infeasibility of collecting this data at a central
location for analysis requires effective parallel and distributed algorithms. In the area of
cryptography, some of the most spectacular applications of Internet-based parallel computing
have focused on factoring extremely large integers.

Prepared By: Dr. Vishnu Kumar K Professor/CSE Department, KPRIET, Coimbatore. Page 2
U21CS603
[UNIT I – PARALLELISM FUNDAMENTALS AND
DS ARCHITECTURE ]

4. Flynn’s Taxonomy or Flynn’s classification of computer (GATE topic for CSE):

Flynn's taxonomy is a classification of computer architectures, proposed by Michael J. Flynn in 1966


and extended in 1972. The classification system has stuck, and it has been used as a tool in the design
of modern processors and their functionalities. Since the rise of multiprocessing central processing
units (CPUs), a multiprogramming context has evolved as an extension of the classification system.

➢ What is Flynn’s classification of computer?

M.J. Flynn offered a classification for a computer


system’s organization based on the number of
instructions as well as data items that are changed at
the same time. An instruction stream is a collection
of instructions read from memory. A data stream is
the result of the actions done on the data in the
processor. The term ‘stream’ refers to the flow of data
or instructions. Parallel processing can happen in the
data stream, the instruction stream, or both.

Computers can be divided into the following major groups according to Flynn’s Classification:

➢ SISD: SISD is an abbreviation for Single


Instruction and Single Data Stream. It depicts the
structure of a single computer, which includes a
control unit, a memory unit, and a processor unit.
This system may or may not consist of internal
parallel processing capability; therefore,
instructions are performed sequentially. Like
classic Von-Neumann computers, most
conventional computers utilize the SISD architecture. Multiple functional units or pipeline
processing can be used to achieve parallel processing in this case.. The Control Unit decodes
the instructions before sending them to the processing units for their execution. The Data Stream
is a bi-directional data stream that moves between the memory and processors. Here, PE =
Processing Element, CU = Control Unit, and M = Memory. Examples: Minicomputers,
workstations, and computers from previous generations.

Prepared By: Dr. Vishnu Kumar K Professor/CSE Department, KPRIET, Coimbatore. Page 3
U21CS603
[UNIT I – PARALLELISM FUNDAMENTALS AND
DS ARCHITECTURE ]

➢ SIMD: SIMD is an abbreviation for


Single Instruction and Multiple Data
Stream. It symbolizes an organization
with a large number of processing units
overseen by a common control unit.
The control unit sends the same
instruction to all processors, but they
work on separate data. To connect with
all of the processors at the same time,
the shared memory unit must have
numerous modules.
SIMD was created with array
processing devices in mind. Vector
processors, on the other hand, can be included in this category according to Flynn’s taxonomy.
There are architectures that are not vector processors but are SIMD architectures. The
Connection Machine and numerous GPUs are two examples of multiple processors executing
the same instructions.
Traditionally, larger and variable vector sizes were employed in vector computers like the Cray
or the STAR. Superscalar is one of the ways to implement a processor, but unlike Flynn’s
Taxonomy, it makes no declaration about its instruction set.

➢ MISD: MISD is an abbreviation for


Multiple Instruction and Single Data
stream. Because no real system has
been built using the MISD structure, it
is primarily of theoretical importance.
Multiple processing units work on a
single data stream in MISD. Each
processing unit works on the data in its
own way, using its own instruction
stream.
Here, M = Memory Modules, P = Processor Units, and CU = Control Unit
Example: The experimental Carnegie-Mellon computer C.mmp (in 1971)

Prepared By: Dr. Vishnu Kumar K Professor/CSE Department, KPRIET, Coimbatore. Page 4
U21CS603
[UNIT I – PARALLELISM FUNDAMENTALS AND
DS ARCHITECTURE ]

➢ MIMD: MIMD is an abbreviation for


Multiple Instruction and Multiple Data
Stream. All processors of a parallel computer
may execute distinct instructions and act on
different data at the same time in this
organization. Each processor in MIMD has its
own program, and each program generates an
instruction stream.
Here, PE = Processing Element, M = Memory
Module, and CU = Control Unit
Example: IBM-SP2, Cray T3E, Cray T90

5. Multi-Core Processors

A multicore processor is an integrated circuit that has two or more processor cores attached for
enhanced performance and reduced power consumption. These processors also enable more efficient
simultaneous processing of multiple tasks, such as with parallel processing and multithreading. A dual
core setup is similar to having multiple, separate processors installed on a computer. However, because
the two processors are plugged into the same socket, the connection between them is faster. The use of
multicore processors or microprocessors is one approach to boost processor performance without
exceeding the practical limitations of semiconductor design and fabrication. Using multicores also
ensure safe operation in areas such as heat generation.

➢ Multicore processors working concept: The heart of every processor is an execution engine,
also known as a core. The core is designed to process instructions and data according to the
direction of software programs in the computer's memory. Over the years, designers found that
every new processor design had limits. Numerous technologies were developed to accelerate
performance, including the following ones: (ref: https://www.techtarget.com/searchdatacenter/definition/multi-core-
processor)
o Clock speed. One approach was to make the processor's clock faster. The clock is the
"drumbeat" used to synchronize the processing of instructions and data through the
processing engine. Clock speeds have accelerated from several megahertz to several
gigahertz (GHz) today. However, transistors use up power with each clock tick. As a
result, clock speeds have nearly reached their limits given current semiconductor
fabrication and heat management techniques. Figure 1. Depicts the architecture of
multicore processor.
o Hyper-threading. Another approach involved the handling of multiple instruction
threads. Intel calls this hyper-threading. With hyper-threading, processor cores are
designed to handle two separate instruction threads at the same time. When properly
enabled and supported by both the computer's firmware and operating system (OS),

Prepared By: Dr. Vishnu Kumar K Professor/CSE Department, KPRIET, Coimbatore. Page 5
U21CS603
[UNIT I – PARALLELISM FUNDAMENTALS AND
DS ARCHITECTURE ]

hyper-threading techniques enable one physical core to function as two logical cores.
Still, the processor only possesses a single physical core. The logical abstraction of the
physical processor added little real performance to the processor other than to help
streamline the behavior of multiple simultaneous applications running on the computer.
o More chips. The next step was to add processor chips -- or dies -- to the processor
package, which is the physical device that plugs into the motherboard. A dual-core
processor includes two separate processor cores. A quad-core processor includes four
separate cores. Today's multicore processors can easily include 12, 24 or even more
processor cores. The multicore approach is almost identical to the use of multiprocessor
motherboards, which have two or four separate processor sockets. The effect is the same.
Today's huge processor performance involves the use of processor products that
combine fast clock speeds and multiple hyper-threaded cores.

Figure 1. Multicore processors Architecture

However, multicore chips have several issues to consider. First, the addition of more processor cores
doesn't automatically improve computer performance. The OS and applications must direct software
program instructions to recognize and use the multiple cores. This must be done in parallel, using
various threads to different cores within the processor package. Some software applications may need
to be refactored to support and use multicore processor platforms. Otherwise, only the default first
processor core is used, and any additional cores are unused or idle.

Prepared By: Dr. Vishnu Kumar K Professor/CSE Department, KPRIET, Coimbatore. Page 6
U21CS603
[UNIT I – PARALLELISM FUNDAMENTALS AND
DS ARCHITECTURE ]

Second, the performance benefit of additional cores is not a direct multiple. That is, adding a second
core does not double the processor's performance, or a quad-core processor does not multiply the
processor's performance by a factor of four. This happens because of the shared elements of the
processor, such as access to internal memory or caches, external buses and computer system memory.

The benefit of multiple cores can be substantial, but there are practical limits. Still, the acceleration is
typically better than a traditional multiprocessor system because the coupling between cores in the same
package is tighter and there are shorter distances and fewer components between cores.

Consider the analogy of cars on a road. Each car might be a processor, but each car must share the
common roads and traffic limitations. More cars can transport more people and goods in a given time,
but more cars also cause congestion and other problems.

➢ Types of multicore processors: Different multicore processors often have different numbers
of cores. For example, a quad-core processor has four cores. The number of cores is usually a
power of two. (ref: https://insights.sei.cmu.edu/blog/multicore-processing/)
➢ Core types:
o Homogeneous (symmetric) cores. All of the cores in a homogeneous multicore
processor are of the same type; typically, the core processing units are general-purpose
central processing units that run a single multicore operating system.
o Heterogeneous (asymmetric) cores. Heterogeneous multicore processors have a mix
of core types that often-run different operating systems and include graphics processing
units.
➢ Number and level of caches. Multicore processors vary in terms of their instruction and data
caches, which are relatively small and fast pools of local memory.
➢ How cores are interconnected. Multicore processors also vary in terms of their bus
architectures.
➢ Isolation. The amount, typically minimal, of in-chip support for the spatial and temporal
isolation of cores:
o Physical isolation ensures that different cores cannot access the same physical hardware
(e.g., memory locations such as caches and RAM).
o Temporal isolation ensures that the execution of software on one core does not impact
the temporal behavior of software running on another core.
➢ Homogeneous Multicore Processor: The figure.2 notionally shows the architecture of a
system in which 14 software applications are allocated by a single host operating system to the
cores in a homogeneous quad-core processor. In this architecture, there are three levels of cache,
which are progressively larger but slower: L1 (consisting of an instruction cache and a data
cache), L2, and L3. Note that the L1 and L2 caches are local to a single core, whereas L3 is
shared among all four cores.

Prepared By: Dr. Vishnu Kumar K Professor/CSE Department, KPRIET, Coimbatore. Page 7
U21CS603
[UNIT I – PARALLELISM FUNDAMENTALS AND
DS ARCHITECTURE ]

Figure 2. Homogeneous Multicore Processor


➢ Heterogeneous Multicore Processor: The figure.3 notionally shows how these 14 applications
could be allocated to four different operating systems, which in turn are allocated to four
different cores, in a heterogeneous, quad-core processor. From left to right, the cores include a
general-purpose central processing unit core running Windows; a graphical processing unit
(GPU) core running graphics-intensive applications on Linux; a digital signal processing (DSP)
core running a real-time operating system (RTOS); and a high-performance core also running
an RTOS.

➢ Pros of Multicore Processing: By using multicore processors, architects can decrease the
number of embedded computers. By allocating applications to different cores, multicore
processing increases the intrinsic support for actual (as opposed to virtual) parallel processing
within individual software applications across multiple applications. Multicore processing can
increase performance by running multiple applications concurrently. Allocating software to
multiple cores increases reliability and robustness (i.e., fault and failure tolerance) by limiting
fault and/or failure propagation from software on one core to software on another.
➢ Cons of Multicore Processing: Shared Resources. Cores on the same processor share both
processor-internal resources (L3 cache, system bus, memory controller, I/O controllers, and
interconnects) and processor-external resources (main memory, I/O devices, and networks).
These shared resources imply (1) the existence of single points of failure, (2) two applications
running on the same core can interfere with each other, and (3) software running on one core
can impact software running on another core (i.e., interference can violate spatial and temporal

Prepared By: Dr. Vishnu Kumar K Professor/CSE Department, KPRIET, Coimbatore. Page 8
U21CS603
[UNIT I – PARALLELISM FUNDAMENTALS AND
DS ARCHITECTURE ]

isolation because multicore support for isolation is limited). The diagram below uses the color
red to illustrate six shared resources.

Figure 3. Heterogeneous Multicore Processor (source: CMU software engineering institute)

Concurrency Defects. Cores execute concurrently, creating the potential for concurrency
defects including deadlock, livelock, starvation, suspension, (data) race conditions, priority
inversion, order violations, and atomicity violations. Note that these are essentially the same
types of concurrency defects that can occur when software is allocated to multiple threads on a
single core.
Non-determinism. Multicore processing increases non-determinism. For example, I/O
Interrupts have top-level hardware priority (also a problem with single core processors).
Multicore processing is also subject to lock trashing, which stems from excessive lock conflicts
due to simultaneous access of kernel services by different cores (resulting in decreased
concurrency and performance). The resulting non-deterministic behavior can be unpredictable,
can cause related faults and failures, and can make testing more difficult (e.g., running the same
test multiple times may not yield the same test result).
Analysis Difficulty. The real concurrency due to multicore processing requires different
memory consistency models than virtual interleaved concurrency. It also breaks traditional
analysis approaches for work on single core processors. The analysis of maximum time limits
is harder and may be overly conservative. Although interference analysis becomes more
complex as the number of cores-per-processor increases, overly-restricting the core number
may not provide adequate performance.

Prepared By: Dr. Vishnu Kumar K Professor/CSE Department, KPRIET, Coimbatore. Page 9
U21CS603
[UNIT I – PARALLELISM FUNDAMENTALS AND
DS ARCHITECTURE ]

➢ Application of Multicore processors: Multicore processors work on any modern computer


hardware platform. Virtually all PCs and laptops today build in some multicore processor
model. However, the true power and benefit of these processors depend on software applications
designed to emphasize parallelism. A parallel approach divides application work into numerous
processing threads, and then distributes and manages those threads across two or more processor
cores. There are several major use cases for multicore processors, including the following five:
o Virtualization. A virtualization platform, such as VMware, is designed to abstract the
software environment from the underlying hardware. Virtualization is capable of
abstracting physical processor cores into virtual processors or central processing units
(vCPUs) which are then assigned to virtual machines (VMs). Each VM becomes a
virtual server capable of running its own OS and application. It is possible to assign
more than one vCPU to each VM, allowing each VM and its application to run parallel
processing software if desired.
o Databases. A database is a complex software platform that frequently needs to run many
simultaneous tasks such as queries. As a result, databases are highly dependent on
multicore processors to distribute and handle these many task threads. The use of
multiple processors in databases is often coupled with extremely high memory capacity
that can reach 1 terabyte or more on the physical server.
o Analytics and HPC. Big data analytics, such as machine learning, and high-
performance computing (HPC) both require breaking large, complex tasks into smaller
and more manageable pieces. Each piece of the computational effort can then be solved
by distributing each piece of the problem to a different processor. This approach enables
each processor to work in parallel to solve the overarching problem far faster and more
efficiently than with a single processor.
o Cloud. Organizations building a cloud will almost certainly adopt multicore processors
to support all the virtualization needed to accommodate the highly scalable and highly
transactional demands of cloud software platforms such as OpenStack. A set of servers
with multicore processors can allow the cloud to create and scale up more VM instances
on demand.
o Visualization. Graphics applications, such as games and data-rendering engines, have
the same parallelism requirements as other HPC applications. Visual rendering is math-
and task-intensive, and visualization applications can make extensive use of multiple
processors to distribute the calculations required. Many graphics applications rely on
graphics processing units (GPUs) rather than CPUs. GPUs are tailored to optimize
graphics-related tasks. GPU packages often contain multiple GPU cores, similar in
principle to multicore processors. (Source: TechTarget)

Prepared By: Dr. Vishnu Kumar K Professor/CSE Department, KPRIET, Coimbatore. Page 10
U21CS603
[UNIT I – PARALLELISM FUNDAMENTALS AND
DS ARCHITECTURE ]

6. Shared Vs Distributed Memory:


In General Shared memory and distributed memory are low-level programming abstractions
that are used with certain types of parallel programming. Shared memory allows multiple
processing elements to share the same location in memory (that is to see each other’s reads and
writes) without any other special directives, while distributed memory requires explicit
commands to transfer data from one processing element to another.
There are two issues to consider regarding the terms shared memory and distributed memory.
One is what do these mean as programming abstractions, and the other is what do they mean in
terms of how the hardware is actually implemented.
In the past there were true shared memory cache-coherent multiprocessor systems. The systems
communicated with each other and with shared main memory over a shared bus. This meant
that any access from any processor to main memory would have equal latency. Today these
types of systems are not manufactured. Instead there are various point-to-point links between
processing elements and memory elements (this is the reason for non-uniform memory access,
or NUMA). However, the idea of communicating directly through memory remains a useful
programming abstraction. So in many systems this is handled by the hardware and the
programmer does not need to insert any special directives. Some common programming
techniques that use these abstractions are OpenMP and Pthreads.
Distributed memory has traditionally been associated with processors performing computation
on local memory and then once it using explicit messages to transfer data with remote
processors. This adds complexity for the programmer, but simplifies the hardware
implementation because the system no longer has to maintain the illusion that all memory is
actually shared. This type of programming has traditionally been used with supercomputers that
have hundreds or thousands of processing elements. A commonly used technique is MPI.
However, supercomputers are not the only systems with distributed memory. Another example
is GPGPU programming which is available for many desktop and laptop systems sold today.
Both CUDA and OpenCL require the programmer to explicitly manage sharing between the
CPU and the GPU (or other accelerator in the case of OpenCL). This is largely because when
GPU programming started the GPU and CPU memory was separated by the PCI bus which has
a very long latency compared to performing computation on the locally attached memory. So
the programming models were developed assuming that the memory was separate (or
distributed) and communication between the two processing elements (CPU and GPU) required
explicit communication. Now that many systems have GPU and CPU elements on the same die
there are proposals to allow GPGPU programming to have an interface that is more like shared
memory. (ref: https://stackoverflow.com/questions/36642382/main-difference-between-shared-memory-and-distributed-
memory)
➢ Shared Memory: Shared memory is the memory which all the processors can access. In
hardware point of view it means all the processors have direct access to the common physical
memory through bus based (usually using wires) access. These processors can work

Prepared By: Dr. Vishnu Kumar K Professor/CSE Department, KPRIET, Coimbatore. Page 11
U21CS603
[UNIT I – PARALLELISM FUNDAMENTALS AND
DS ARCHITECTURE ]

independently while they all access the same memory. Any change in the variables stored in the
memory is visible by all processors because at any given moment all they see is a copy or picture
of entire variables stored in the memory and they can directly address and access the same
logical memory locations regardless of where the physical memory actually exists. Figure.4
Shared memory example (ref: https://help.rc.ufl.edu/doc/Memory:_Shared_vs_Distributed)
Uniform Memory Access (UMA):
o Most commonly represented today by Symmetric Multiprocessor (SMP) machines
o Identical processors
o Equal access and access times to memory
o Sometimes called CC-UMA - Cache Coherent UMA. Cache coherent means if one
processor updates a location in shared memory, all the other processors know about the
update. Cache coherency is accomplished at the hardware level.

Figure.4 Shared memory example

o Non-Uniform Memory Access (NUMA):


▪ Often made by physically linking two or more SMPs
▪ One SMP can directly access memory of another SMP
▪ Not all processors have equal access time to all memories
▪ Memory access across link is slower
▪ If cache coherency is maintained, then may also be called CC-NUMA - Cache
Coherent NUMA

Figure.5 NUMA

Prepared By: Dr. Vishnu Kumar K Professor/CSE Department, KPRIET, Coimbatore. Page 12
U21CS603
[UNIT I – PARALLELISM FUNDAMENTALS AND
DS ARCHITECTURE ]

➢ Distributed Memory: Distributed memory in hardware sense, refers to the case where the
processors can access another processor's memory only through network. In software sense, it
means each processor only can see local machine memory directly and should use
communications through network to access memory of the other processors. Figure 6. Illustrates
the distributed memory architecture.

Figure.6 Distributed Memory

7. Introduction to OpenMP Programming

OpenMP is a standard parallel programming API for shared memory environments, written in C, C++,
or FORTRAN. It consists of a set of compiler directives with a “lightweight” syntax, library routines,
and environment variables that influence run-time behavior. OpenMP is governed by OpenMP
Architecture Review Board (or OpenMP ARB), and is defined by several hardware and software
vendors.

OpenMP behavior is directly dependent on the OpenMP implementation. Capabilities of this


implementation can enable the programmer to separate the program into serial and parallel regions
rather than just concurrently running threads, hides stack management, and provides synchronization
of constructs. That being said OpenMP will not guarantee speedup, parallelize dependencies, or prevent
data racing. Data racing, keeping track of dependencies, and working towards a speedup are all up to
the programmer.

➢ Use of OpenMP: OpenMP has received considerable attention in the past decade and is
considered by many to be an ideal solution for parallel programming because it has unique
advantages as a mainstream directive-based programming model.
First of all, OpenMP provides a cross-platform, cross-compiler solution. It supports lots of
platforms such as Linux, macOS, and Windows. Mainstream compilers including GCC,

Prepared By: Dr. Vishnu Kumar K Professor/CSE Department, KPRIET, Coimbatore. Page 13
U21CS603
[UNIT I – PARALLELISM FUNDAMENTALS AND
DS ARCHITECTURE ]

LLVM/Clang, Intel Fortran, and C/C++ compilers provide OpenMP good support. Also, with
the rapid development of OpenMP, many researchers and computer vendors are constantly
exploring how to optimize the execution efficiency of OpenMP programs and continue to
propose improvements for existing compilers or develop new compilers. What’s more.
OpenMP is a standard specification, and all compilers that support it implement the same set of
standards, and there are no portability issues.
Secondly, using OpenMP can be very convenient and flexible to modify the number of threads.
To solve the scalability problem of the number of CPU cores. In the multi-core era, the number
of threads needs to change according to the number of CPU cores. OpenMP has irreplaceable
advantages in this regard.

Figure.7 OpenMP Solution Stack

Thirdly, using OpenMP to create threads is considered to be convenient and relatively easy
because it does not require an entry function, the code within the same function can be
decomposed into multiple threads for execution, and a for loop can be decomposed into multiple
threads for execution. If OpenMP is not used, when the operating system API creates a thread,
the code in a function needs to be manually disassembled into multiple thread entry functions.
To sum up, OpenMP has irreplaceable advantages in parallel programming. More and more
new directives are being added to achieve more functions, and they are playing an important
role on many different platforms. Figure.7 illustrates the OpnMP solution stack.
➢ Installation of OpenMP
Installation Steps on Linux Systems
Install gcc compiler: sudo apt-get install build-essentials

Prepared By: Dr. Vishnu Kumar K Professor/CSE Department, KPRIET, Coimbatore. Page 14
U21CS603
[UNIT I – PARALLELISM FUNDAMENTALS AND
DS ARCHITECTURE ]
Install OpenMP library: sudo apt-get install libomp-dev
Installation Steps on Windows Systems – Windows not recommended

For more detailed OpenMP Installation with sample code follow the below link
Option1: https://www.geeksforgeeks.org/openmp-introduction-with-installation-
guide/
Option2: https://www.youtube.com/watch?v=5cVU4MKsvqU

➢ Creating a Simple Parallel Loop


The following example demonstrates how to parallelize a simple loop using the parallel loop
construct. The loop iteration variable is private by default, so it is not necessary to specify it
explicitly in a private clause.
//%compiler: clang
//%cflags: -fopenmp
#include <stdio.h>
#include <stdlib.h>
#include <omp.h>

int main(int argc, char *argv[]){


#pragma omp parallel
printf("%s\n", "Hello World");

return 0;
}
Ref: https://passlab.github.io/OpenMPProgrammingBook/openmp_c/2_Syntax.html

Note: For more programming concepts the readers are instructed to go through the following
link. (https://www.openmp.org/wp-content/uploads/Intro_To_OpenMP_Mattson.pdf).
Assignments will be given during the lecture hours.
The following are to be self-studies to create the programming environment from the material
provided in the above links
➢ Getting Started with OpenMP:
o Introduction to parallel programming
o Hello world and how threads work
➢ The Core features of OpenMP
o Creating Threads (the Pi program)
o Parallel Loops (making the Pi program simple)
➢ Working with OpenMP
o Synchronize single masters and stuff

8. SIMD: Single instruction, multiple data (SIMD) is a form of parallel execution in which the
same operation is performed on multiple data elements independently in hardware vector
processing units (VPU), also called SIMD units. The addition of two vectors to form a third

Prepared By: Dr. Vishnu Kumar K Professor/CSE Department, KPRIET, Coimbatore. Page 15
U21CS603
[UNIT I – PARALLELISM FUNDAMENTALS AND
DS ARCHITECTURE ]

vector is a SIMD operation. Many processors have SIMD (vector) units that can perform
simultaneously 2, 4, 8 or more executions of the same operation (by a single SIMD unit).
Loops without loop-carried backward dependency (or with dependency preserved using ordered
simd) are candidates for vectorization by the compiler for execution with SIMD units. In
addition, with state-of-the-art vectorization technology and declare simd construct extensions
for function vectorization in the OpenMP 4.5 specification, loops with function calls can be
vectorized as well. The basic idea is that a scalar function call in a loop can be replaced by a
vector version of the function, and the loop can be vectorized simultaneously by combining a
loop vectorization (simd directive on the loop) and a function vectorization (declare simd
directive on the function).
A simd construct states that SIMD operations be performed on the data within the loop. A
number of clauses are available to provide data-sharing attributes (private, linear, reduction and
lastprivate). Other clauses provide vector length preference/restrictions (simdlen / safelen), loop
fusion (collapse), and data alignment (aligned).
The declare simd directive designates that a vector version of the function should also be
constructed for execution within loops that contain the function and have a simd directive.
Clauses provide argument specifications (linear, uniform, and aligned), a requested vector
length (simdlen), and designate whether the function is always/never called conditionally in a
loop (branch/inbranch). The latter is for optimizing peformance.
Also, the simd construct has been combined with the worksharing loop constructs (for simd and
do simd) to enable simultaneous thread execution in different SIMD units.
➢ Simd and declare simd:
The following example illustrates the basic use of the simd construct to assure the compiler that
the loop can be vectorized

Example SIMD.1.c
void star( double *a, double *b, double *c, int n, int *ioff )
{
int i;
#pragma omp simd
for ( i = 0; i < n; i++ )
a[i] *= b[i] * c[i+ *ioff];
}
Ref: https://www.ibm.com/docs/en/xl-c-and-cpp-linux/16.1.0?topic=pdop-pragma-omp-simd

9. Vector Processing

Vector processor is basically a central processing unit that has the ability to execute the complete vector
input in a single instruction. More specifically we can say, it is a complete unit of hardware resources
that executes a sequential set of similar data items in the memory using a single instruction.

Prepared By: Dr. Vishnu Kumar K Professor/CSE Department, KPRIET, Coimbatore. Page 16
U21CS603
[UNIT I – PARALLELISM FUNDAMENTALS AND
DS ARCHITECTURE ]

Unlike scalar processors that operate on only a single pair of data, a vector processor operates on
multiple pair of data. However, one can convert a scalar code into vector code. This conversion process
is known as vectorization. So, we can say vector processing allows operation on multiple data elements
by the help of single instruction. These instructions are said to be single instruction multiple data or
vector instructions. The CPU used in recent time makes use of vector processing as it is advantageous
than scalar processing. Let us now move further to understand how the vector processor functions.

Figure.8 Functional Architecture of Vector computer

The functional units of a vector computer are as follows:

o IPU or instruction processing unit


o Vector register
o Scalar register
o Scalar processor
o Vector instruction controller
o Vector access controller
o Vector processor

Prepared By: Dr. Vishnu Kumar K Professor/CSE Department, KPRIET, Coimbatore. Page 17
U21CS603
[UNIT I – PARALLELISM FUNDAMENTALS AND
DS ARCHITECTURE ]

Let us now understand the overall operation performed by the vector computer.

As it has several functional pipes thus it can execute the instructions over the operands. We know that
both data and instructions are present in the memory at the desired memory location. So, the instruction
processing unit i.e., IPU fetches the instruction from the memory.

Once the instruction is fetched then IPU determines either the fetched instruction is scalar or vector in
nature. If it is scalar in nature, then the instruction is transferred to the scalar register and then further
scalar processing is performed.

While, when the instruction is a vector in nature then it is fed to the vector instruction controller. This
vector instruction controller first decodes the vector instruction then accordingly determines the address
of the vector operand present in the memory.

Then it gives a signal to the vector access controller about the demand of the respective operand. This
vector access controller then fetches the desired operand from the memory. Once the operand is fetched
then it is provided to the instruction register so that it can be processed at the vector processor.

At times when multiple vector instructions are present, then the vector instruction controller provides
the multiple vector instructions to the task system. And in case the task system shows that the vector
task is very long then the processor divides the task into subvectors.

These subvectors are fed to the vector processor that makes use of several pipelines in order to execute
the instruction over the operand fetched from the memory at the same time.

The various vector instructions are scheduled by the vector instruction controller.

➢ Characteristic of vector processing

A vector is defined as an ordered set of a one-dimensional array of data items. A vector V of length n
can be represented as a row vector by V = [V1 V2 V3 · · · Vn]. If the data items are listed in a column,
it may be represented as a column vector. For a processor with multiple ALUs, it is possible to operate
on multiple data elements in parallel using a single instruction. Such instructions are called single-
instruction multiple-data (SIMD) instructions. They are also called vector instructions.

VectorAdd.S Vi, Vj, Vk

The above Vector instruction computes L sums using the elements in vector registers Vj and Vk, and
places the resulting sums in vector register Vi. Similar instructions are used to perform other arithmetic
operations.

VectorLoad.S Vi, X(Rj)

Prepared By: Dr. Vishnu Kumar K Professor/CSE Department, KPRIET, Coimbatore. Page 18
U21CS603
[UNIT I – PARALLELISM FUNDAMENTALS AND
DS ARCHITECTURE ]

The above Vector instruction is to transfer multiple data elements between a vector register and the
memory. A computer capable of vector processing eliminates the overhead associated with the time it
takes to fetch and execute the instructions in the program loop. It allows operations to be specified with
a single vector instruction of the form

C(1: 100) = A(1: 100) + B(1: 100).

The vector instruction includes the initial address of the operands, the length of the vectors, and the
operation to be performed, all in one composite instruction.

The instruction format of the vector processor is

Base address Base address Base address


Operation code Vector length
source1 source2 destination

This is a three-address instruction with three fields specifying the length of the data items in the vectors
and the base address of the operands and an additional field. This assumes that the vector operands
reside in memory. It is also possible to design the processor with a large number of registers and store
all operands in registers prior to the addition operation. In that case, the base address and length in the
vector instruction specify a group of CPU registers. In a source program written in a high-level
language, if the operations performed in each pass are independent of the other passes, loops that
operate on arrays of integers or floating-point numbers are vectorizable.

A vectorizing compiler can recognize such loops and generate vector instruction if they are not too
complex. Using vector instructions reduces the number of instructions that need to be executed and
enables the operations to be performed in parallel on multiple ALUs. (ref:
https://www.codingninjas.com/studio/library/vector-processing)

➢ Classification of vector processor:

The classification of vector processor relies on the ability of vector formation as well as the presence
of vector instruction for processing. So, depending on these criteria, vector processing is classified as
follows:

Prepared By: Dr. Vishnu Kumar K Professor/CSE Department, KPRIET, Coimbatore. Page 19
U21CS603
[UNIT I – PARALLELISM FUNDAMENTALS AND
DS ARCHITECTURE ]

Figure.9 Classification of vector processor

o Register to Register Architecture: This architecture is highly used in vector computers. As in


this architecture, the fetching of the operand or previous results indirectly takes place through
the main memory by the use of registers. The several vector pipelines present in the vector
computer help in retrieving the data from the registers and also storing the results in the desired
register. These vector registers are user instruction programmable. This means that according
to the register address present in the instruction, the data is fetched and stored in the desired
register. These vector registers hold fixed length like the register length in a normal processing
unit. Some examples of a supercomputer using the register to register architecture are Cray – 1,
Fujitsu etc.

o Memory to Memory Architecture: Here in memory to memory architecture, the operands or


the results are directly fetched from the memory despite using registers. However, it is to be
noted here that the address of the desired data to be accessed must be present in the vector
instruction. This architecture enables the fetching of data of size 512 bits from memory to
pipeline. However, due to high memory access time, the pipelines of the vector computer
requires higher startup time, as higher time is required to initiate the vector instruction. Some
examples of supercomputers that possess memory to memory architecture are Cyber 205, CDC
etc.

➢ Advantages of Vector Processor


o Vector processor uses vector instructions by which code density of the instructions can
be improved.
o The sequential arrangement of data helps to handle the data by the hardware in a better
way.
o It offers a reduction in instruction bandwidth.

Prepared By: Dr. Vishnu Kumar K Professor/CSE Department, KPRIET, Coimbatore. Page 20
U21CS603
[UNIT I – PARALLELISM FUNDAMENTALS AND
DS ARCHITECTURE ]

o So, from the above discussion, we can conclude that register to register architecture is
better than memory to memory architecture because it offers a reduction in vector
access time. (ref: https://electronicsdesk.com/vector-processor.html)

➢ Graphical Processing Units (GPUs):

GPU stands for graphics processing unit. GPUs were originally designed specifically to accelerate
computer graphics workloads, particularly for 3D graphics. While they are still used for their original
purpose of accelerating graphics rendering, GPU parallel computing is now used in a wide range of
applications, including graphics and video rendering. GPU parallel computing is the ability to perform
several tasks at once. GPU Parallel computing enables GPUs to break complex problems into thousands
or millions of separate tasks and work them out all at once instead of one-by-one like a CPU needs to.
(ref: https://people.duke.edu/~ccc14/sta-663/CUDAPython.html)

Figure.10 GPU processing flow

The GPU parallel computing ability is what makes GPUs so valuable. It is also what makes them
flexible and allows them to be used in a wide range of applications, including graphics and video
rendering.

An example of parallel processing is demonstrated by myth busters

https://www.youtube.com/watch?v=-P28LKWTzrI&t=93s

Prepared By: Dr. Vishnu Kumar K Professor/CSE Department, KPRIET, Coimbatore. Page 21
U21CS603
[UNIT I – PARALLELISM FUNDAMENTALS AND
DS ARCHITECTURE ]

o GPU Architecture (CPU vs GPU)


A CPU is designed to handle complex tasks - time sliciing, virtual machine emulation,
complex control flows and branching, security etc. In contrast, GPUs only do one thing well
- handle billions of repetitive low level tasks - originally the rendering of triangles in 3D
graphics, and they have thousands of ALUs as compared with the CPUs 4 or 8.. Many
scientific prgorams spend most of their time doing just what GPUs are good for - handle
billions of repetitive low level tasks - and hence the fidle of GPU computing was born.
Originally, this was called GPCPU (General Purpose GPU programming), and it required
mapping scientific code to the matrix operations for manipulating traingles. This was
insanely difficult to do and took a lot of dedication. However, with the advent of CUDA
and OpenCL, high-level langagues targeting the GPU, GPU programming is rapidly
becoming mainstream in the scientific community.

Figure.11 CPU vs GPU

Prepared By: Dr. Vishnu Kumar K Professor/CSE Department, KPRIET, Coimbatore. Page 22

You might also like