You are on page 1of 8

Proceedings of the Fifth International Conference on Computing Methodologies and Communication (ICCMC 2021)

IEEE Xplore Part Number: CFP21K25-ART

Parallel Programming Models and Paradigms:


OpenMP Analysis
2021 5th International Conference on Computing Methodologies and Communication (ICCMC) | 978-1-6654-0360-3/20/$31.00 ©2021 IEEE | DOI: 10.1109/ICCMC51019.2021.9418401

Arwa Alrawais
College of Computer Engineering and Sciences
Prince Sattam Bin Abdulaziz University
Al-Kharj 11942, Saudi Arabia.
Email: a.alrawais@psau.edu.sa

Abstract—The increase demand for processing power has parallel programming paradigms. Finally, some of the parallel
grown over the years, this demand lend to the parallel approach programming application will be represented, such as OpenMP.
which means linking a bunch of computers together to jointly
increase both the speed and efficiency. The parallelism approach There is a miss conceptual among most of the programmers
plays a significant role in the new generation’s applications by that writing a parallel programming is hard, because of that
moving the technologies from expensive and specialized parallel most of the programmer prefer writing a sequential program
supercomputers to linking a set of computers. Throughout the than writing a parallel program. “The parallel programming
years, the parallel approach lends to parallel programming model exists as abstraction above hardware and memory
models which exists above hardware and memory architectures. It architecture [2]”. It attempts to express the parallel program,
is a collection of software technologies that present parallel algo- and exploit the parallelism to solve a problem. The models are
rithms and resembling applications with underlying system. This
paper describes the essential concept of parallel programming
different from each other, each can execute a different range of
and a brief overview of different areas of parallel programming problems, and each can be run in different architecture. There
models and paradigms. Furthermore, it implements and evaluates are two main approaches for parallel programming:
OpenMP parallel programming and illustrates its effectiveness.
• Implicit parallelism: when the programmer does not
Keywords—Parallel programming models; Parallel program- specify the parallelism, therefore cannot control com-
ming paradigms; Parallel algorithm; Parallel applications putation scheduling or the data placement.
• Explicit parallelism: The parallelism is explicitly spec-
I. I NTRODUCTION ified in the program code by the programmer using
some tools, such as library cells. This approach per-
In 1980’s, the computer at that time was thought, the best mits the user to assess how much parallelism can
computer in its performance by making the computer faster be exploited. The efficiency that is obtained by the
and adding more efficient processors. Throughout the years, explicitly parallelism is better than that is obtained by
the shifting to parallel processing that essentially attempt to the implicitly parallelism [1].
solve a computation problem made the computer performance
different than it was. In the earliest of 1990’s, the endeavor The rest of this paper is organized as follows. In section II,
in transition from expensive and massively parallel processors the paper summarize the related works to parallel models and
that contained some supercomputers toward the networks of paradigms. Section III introduces several programming models
computers was growth sharply. As a consequence, the rising and tools. An implementation and evaluation of OpenMP is
lend to the availability of the high performance computers conducted in Section VII. At last, a conclusion is drawn VIII.
such as PC’s, networks and workstation at the market. The
making of the cluster or network of computers, additionally, II. R ELATED W ORK
are being attractive for many people because the cost effective
of the parallel processing. At the same time of developing Over past few decades, several parallel programming mod-
the hardware demands, there was a progress in providing els and paradigms have been researched. In 1993, [3] Giloi,
programming resources to be attuned with diversity of com- W.K. formed the parallel programming domain in memory
puting environments. The programming development directed sharing model or message passing paradigms. He indicated
to numerous of parallel programming models which permits the advantage and disadvantages for each parallel program-
the expression of parallel to be executed [1]. ming models as well as its appropriate architecture. At that
time, the attention of most of the researchers and developers
The parallel programming models, generally, can be as- were toward more to the parallel programming models, as a
sessed by the range of the problems that can be exhibited and result there were many conferences and workshop held. The
how these models can be executed in several architectures. proceeding third working conference on massively parallel
In choosing a parallel programming model, considering how programming models, and the proceedings third international
much parallelism can be exploited is an important aspect. workshop on high-level parallel programming models and sup-
This paper will propose the parallel programming models portive environments HIP’98 were held for example in 1997
from various aspects. Furthermore, it will explain the different and 1998. Consequently, the mature improvement throughout
phases of designing parallel algorithm as well as different the years on the parallel programming research area lend

1022

Authorized licensed use limited to: Nitte Meenakshi Institute of Technology. Downloaded on October 27,2022 at 13:21:27 UTC from IEEE Xplore. Restrictions apply.
to the development and the appearance of numerous parallel their traditional languages such as, C language program and
programming models and paradigms. FORTRAN. Furthermore, the idea behind using the parallel
programming language is hard; make the programmers prefer
Over the years, there were many new parallel programming
using the sequential languages more than parallel languages.
models. In 2009 [4], MSI (Multi-thread Schedule Interface), a
new parallel programming model has been proposed to utilize
the power of multi-core and conquer the load balancing prob- C. High Performance FORTRAN
lem. Furthermore, the parallel programming models research In 1993, the first High performance FORTRAN version was
attempted to optimize the level of parallelism. For instance, the published to present effective use of the language on different
optimistic parallelism based speculative asynchronous message parallel systems. High performance FORTRAN is an extension
passing introduced in 2010, has exploited more parallelism of Fortran 90; it supports distributing computations of a single
and reduced the overhead of asynchronous during runtime by array over multiple processors by using data parallel model. It
speculating the execution of message passing in object oriented does an efficacious performance on applications that follows
program [5]. On the other hand, the expanding in parallel (single instruction multiple data), and (multiple instruction
programming models permits the models to be involved in stream multiple data stream) architectures. On the other hand,
several domains such as design patterns. According to [6] the High Performance Fortran needs to be addressed if it is
they provided a design patterns based parallel programming used with some applications that follow asynchronous struc-
model DPBPPM and system implementation in SMP platform ture.
environment. The DPBPPM has been proved to be flexible
and efficient system. There is a development and appreciated D. Message passing
effort on high performance computing research on advance
parallel programming models. For instance, the Center for Message passing has libraries which designed to be stan-
Programming Models for Scalable Parallel Computing is a dard for disturbed memory systems, additionally; it designed to
research center that concentrates on programming models for perform the sending, receiving and other operations of message
scalable parallel computing research area. The research center passing. The message passing offers a normal synchronization
is led by Argonne National Laboratory and other Universities, among processors. One of the beneficial ways of message
where they try to develop the current models and advance passing is performed in the debugging, because it does not
models. permit for overwriting memory, even if it is happened it can be
detected easily than shared memory [8]. It works efficiency and
III. PARALLEL P ROGRAMMING M ODELS AND T OOLS naturally in distributed memory systems. The mostly common
systems of message passing are Message Passing Interface and
In parallel programming models section, presents an ap- Virtual Parallel Machine.
proach of different models of parallel programming. This
section, it describes briefly the different parallel program- 1) Message Passing Interface: The Message Passing In-
ming models and approaches including Paralliezing Compilers, terface (MPI) where each processor permits to have an ac-
Parallel Language, High Performance FORTRAN, Message cess by its CPU to the local memory and these processes
Passing, Virtual Shared Memory, Data Parallel, Programming can communicate with each other by sending and receiving
Skelton and The Partition Global Address Space. messages [9]. Furthermore, the transformation of the data
between these processors is desired to have cooperative op-
A. Parallelizing Compilers erations that match the sending side with its receiving side.
Among concurrent processors the Message Passing Interface
Paralliezing compiler is a type of source to source compiler, (MPI) provides a communication between them, including
which its input and output are a high level language. It transfers point to point communication and copulative communications.
the program into parallel output version after analyzing the Moreover, the MPI implementation includes parallel machine,
sequential program and detecting which part of program can shared memory machine, distributed memory multiprocessor,
be parallel. The output program language is usually the same workstations cluster and heterogeneous network. MPI is used
as the input program language. The most common language by developer and users when the parallel system relies on
that is used with the paralliezing compiler is FORTRAN. There message passing, because its standardization, portability, avail-
are two significant reasons that made FORTRAN language is ability, performance, additionally, it is supportable by many
the best language that can be used with paralliezing compiler. HCP platforms.
First reason, The Fortran is been written in many scientific
computing algorithms. Seconds, analyzing and transforming 2) Parallel Virtual Machine: The Parallel Virtual Machine
the program is easier because FORTRAN has some properties (PVM) was developed and PVM3 was completed in 1993 [10].
that it makes paralliezing procedures easy such as static mem- Essentially, it is software that allows of using a collection
ory model [7]. The paralliezing compiler works well in some of the heterogeneous of UNIX and/ or Windows computers
applications on shared memory multiprocessors. Although, to be used as single distributed parallel processes. The PVM
it has some limitation on other applications that deals with system consists of two main parts which are PVM3 that found
distributed memory machines, due to non informality time of with the virtual machine on the computer, and a library of
access a memory [1]. PVM interface that includes a collection of tools to help the
coordination among tasks. PVM model provides the user the
B. Parallel Language environment of the heterogeneous with some features [10]:
Most of the people are not willing to use other languages • Explicit message passing model: each one performs a
which they are not familiar with; they would like to use computation in multitask set.

1023

Authorized licensed use limited to: Nitte Meenakshi Institute of Technology. Downloaded on October 27,2022 at 13:21:27 UTC from IEEE Xplore. Restrictions apply.
• Heterogeneity support: it is supported by PVM in their F. Data Parallel
network, applications, and machines.
Most of the parallel on data parallel model concentrate
• Multiprocessor support: it attempts to exploit the un- on using operations on data which being structured. On each
derlying hardware by using some methods on multi- structure, a group of tasks is being operated, and each task
processors. operates on distance partial of the data structure. The data
parallel model can be used on either SPMD multicomputer or
• Processes based communication: the parallelism item SIMD computers. In SIMD synchronization, all the processors
in PVM is a task that transports among computations are implemented in lockstep fashion. At the compiler time, the
and communications, where each task has to be iden- data parallel model’s synchronization can be performed. Using
tified by unique integer across the whole system. the data parallel model with SIMD computer makes debugging
• User configured host pool: is a set that could in- and writing the code simple, due to the explicitly of parallelism
clude single-processor and multi-processor computers. that is processed by the flow control, and synchronization
It subjects to adjust or delete during the operations. hardware. In general, the data parallel program is easy to
make visualization for the program behavior, additionally; it
The PVM computing model depends on the tasks that are has natural load balance [12]. Examples of data parallel are
being a content of the application, where each task performs image processing, N-body problem, and matrix operations.
a part of the computation. The functionality of these tasks is
various some of them need to be functional in parallel, and G. Programming Skelton
other need to be synchronize, and that specifies any task can
be start or stop and other task may be add or delete during the It is a collection of high level abstraction where most of
executions phase. the parallel paradigms are supported. The paradigm for the
programming has the same control structure, and at the same
time can solve different problems. The programming paradigm
E. Virtual Shared Memory contains useful data and communication patterns, additionally;
it contains an abstraction of the form of the programming
TThe traditional shred memory concept where the mem- or Skelton. A specific parallel programming paradigm is
ory is distributed among processors is different than the corresponded by Skelton which has a single abstraction that
virtual shared memory. The virtual shared memory is all encapsulates both the communication patterns and control.
the processors on the distributed memory machine as they Then, it identifies the parallel programming paradigms. Basi-
all share a single memory. The objective of designing the cally, Skelton is implemented on the top of shared memory or
virtual shared memory is reducing the required overhead of the distributed memory, message passing, and object oriented. As
communication of coherent access. The virtual shared memory a result, Skelton considered as general program that enhance
implantation can be accomplished at any level of hierarchical parallel [1].
computer level as shown in Figure 1 [11].

H. Partitioned Global Address Space


The processors share common addresses that are obtained
by the partitioned process for the global addresses. The Par-
titioned Global Address Space (PGAS) is described as easy
model to program and understand. It is considered as the
best for both models distributed memory model and shared
memory model. “The PGAS is the best of both worlds. This
parallel programming model combined the performance and
Fig. 1: Computer Hierarchical Levels. data locality (partitioning) features of distributed memory
with the programmability and data referencing simplicity of a
shared-memory (global address space) model [13]“. The PGAS
At the hardware level where the virtual machine requires programming model seeks to reach these characteristics by
some special machines such as DASH and DDM, the data size presenting [13]:
is fixed number of bytes around 16-256 bytes, but the size is
not located by the hardware level. At the compiler level, the • A local view programming style (which identifies
virtual shared memory can be implemented at this level, where between data partitions whether is local or remote data
the data can be larger as it is required. In addition, the compiler partitions).
can maintain a set of data where they are logically connected. • A compiler that introduce communication to deter-
The virtual shared memory is also provided at the operating mine remote references.
system level as it is shown in Figure 1 The size of data is
fixed as well as the hardware level, but at this level the data • An address space that can be utilized globally and can
size is larger than the hardware level. At the high level where be accessed by any process directly.
the virtual machine is implemented in the system software is • Improving inter-process performance through one-
providing the programmer more flexibility than other levels. sided communication.
In this level, the data manger process is provided to observe
and control operations. • Supporting verities of distributed data structure.

1024

Authorized licensed use limited to: Nitte Meenakshi Institute of Technology. Downloaded on October 27,2022 at 13:21:27 UTC from IEEE Xplore. Restrictions apply.
Fig. 2: Parallel Algorithm Design Stages.

Each process has a private and shared memory. For instance, C. Agglomeration
if processor has a local data, the private memory will be used
for the local data. Similar, if the processor has a global data, The communication structure and the tasks that made dur-
the shred memory will be used for the global data. With single ing the previous phases are evaluated. In order, for improving
address, the process can access directly any global data. PGAS performance or reducing the cost, the tasks are grouped into
programming languages consists of three main programs that large group.
are widely used: Co-Array FORTRAN (CAF), Unified Parallel
C (UPC), and Titanium [13]. D. Mapping
Mapping focuses on assigning each task to a processor,
IV. PARALLEL A LGORITHM D ESIGN in order to maximize the processes utilization and minimize
the communication’s cost. Mapping can be executed during
There are a verity ways of designing and building a paral- the compiler time (statically) or run time (dynamically) using
lel program. The design methodology that we will illustrate load-balancing methods.
is proposed by Ian Foster that permits the programmer to
concentrate on issues that related to machine independent,
such as concurrency in the early stage, and machine specific V. PARALLEL P ROGRAMMING PARADIGMS
aspects of design are remained until the end of the design When we classify different parallel applications into pro-
process [1]. The design methodology is identified by four gramming paradigms, we found a few frequently paradigms
main stages: agglomeration, communication, mapping and that are used in more than one applications. The parallel
partitioning. Both partitioning and mapping stages are attempts programming paradigms includes different algorithms which
to develop the scalability and concurrently algorithms, while based on the paradigms that are used. Selecting the appropriate
the agglomeration and mapping stages are concentrate more on paradigms to develop a parallel application is based on the
locality and other issues that related to performance as shown algorithms in different paradigms. Furthermore, the type of
in Figure 2. the availability of some recourse for parallel computing, and
the variety of parallel computers that inherits in the problem,
A. Communication are all helps to determine which parallel is required.
The computations and the data in which can be decom-
posed into a small tasks. The domain/data decomposition is A. Choice of Paradigms
a problem that related to the data decomposition, and the Most of the authors who, classified the parallel program
functional decomposition is the decomposition computation into different classes, did not have precisely the same approach.
into tasks. There are some issues in portioning step when the Although, the following paradigms are the most popular
number of computer’s process is ignored. paradigms that are used on the parallel programming: Task
Farming, Single Program Multiple Data, Divide and Conquer,
B. Partitioning Data Pipelining, Hybrid models, and Speculative Parallelism
[15].
The communication is desired the coordination between
the tasks that are obtained from the partioning phase. The B. Task Farming
communication pattern is determined in its phase. There are
four different communication patterns: static or dynamic, struc- The Task Farming paradigms contains two major processes.
tured or unstructured, local or global, and synchronous or First, the master process handles the decomposing of the
asynchronous [1], [14]. problem into several tasks, and then distributes these tasks into

1025

Authorized licensed use limited to: Nitte Meenakshi Institute of Technology. Downloaded on October 27,2022 at 13:21:27 UTC from IEEE Xplore. Restrictions apply.
a farm of slave process. Subsequently, after finishing the slave some parallelism, although the processors require few com-
process, the master assembles partial result in order to compute munication process. There is no required for communication
the final result. Second, the multiple slave process execute in a processes between sub processes, because the sub problems
small cycle: getting the message with the task, processing the are independent [1]. Basically, the divide and conquer can be
task and sending the result to the master. The communication performed into three main steps: compute, spilt and join.
in Task Farming usually occurs among the master process and
slave process [15]. F. Speculative Parallelism
This paradigm can be used in either static load-balancing If the parallelism is hard to be obtained through one of
or dynamic load-balancing. In the static Load-balancing, the the pervious paradigms, in this situation the Speculative Par-
distribution tasks appear in the first stag. As a result of tasks allelism is being used. Even though, there are some problems
distribution, the master can be contributed in the computation, cannot achieve the parallelism, due to the data dependencies
after reserving the slave. In the dynamic load-balancing, if the complex. The speculative parallelism uses some optimistic
numbers of tasks are larger than the number of processes, it operation to simplify the parallelism and executing the problem
can be more appropriate. The dynamic has a significant feature in small fractions. Some cases of using the Speculative Paral-
where the application can alter the system and recognize lelism, are discrete event simulation (asynchronous problem),
the resources of the system. In general, the Task Farming and using different algorithms for the same problem (the first
paradigms can deliver a high scalability and speedily level of one is the one provide the final solution) [1], [15].
computation [1].
G. Hybrid Models
C. Single Program Multiple Data
The Hybrid Models is called Mixed Mode Programming;
In the Single Program Multiple Data (SPMD), divides the it consists of more than one paradigm. The Hybrid Models is
data between the processors, and each processor runs a part used when the mix element of different paradigms is required.
from the total basic algorithm on the data. Finally, it will gather An example of Hybrid model is OpenMP with MPI, where the
the final result in the end. message passing with MPI is used for a communication, and
This kind of parallelism can be a geometric parallelism the OpenMP for each single node is used to control a thread.
or domain decomposition or data parallelism. The SPMD is
the most widespread paradigm is used. This paradigm can be VI. PARALLEL P ROGRAMMING
performed efficiently, if the data distributed equally among the There are many existing parallel programming. This section
processes and the system is homogenous [1], [15]. will discuss and compare between the widely known paral-
lel programming including OpenMP, and Threading Building
D. Data Pipelining Block in terms of many aspects.
In the data pipelining, is one of the most common de-
composition processing paradigm, due to its simplicity and A. OpenMP
robustness. It identifies the parallel tasks on the algorithm OpenMP, which stands for Open Multi-Processing, was
where each processor runs a part of the algorithm. The data developed in cooperative efforts of several software developers
flows in the pipeline from one processor which corresponds to and companies manufacturing, such as Oracle, IBM, Intel, and
a phase in the pipeline structure to another processor until the Hewlett-Packard. It is an Application Program Interface (API)
last processor that provides the output. Each process executes that supports multi-platform shared memory parallel program-
a fraction of the algorithm, and these processes communicate ming: C, C++ and FORTRAN [16]. OpenMP could be found
with each other through data flowing. Usually, the data pipeline in most of the mentioned language architectures including
paradigm is used on the image processing and data reduction UNIX platform, and Windows NT platform. The OpenMP
application [1], [15]. provides the shared-memory parallel programmers a simple
interface for developing parallel application, and that due to
E. Divide and Conquer OpenMP portability and scalability. The OpenMP portability
is a significant characteristic. It permits the compiler, that sup-
The divide and conquer is a top-down approach. The
portable by OpenMP, to be used in any parallel application that
solution of main problem where is in the top level (main
developed by OpenMP. For attaining the parallel performance,
problem) is obtained by combining the solutions of the sub
the compiled binary result must be executed on the hardware
problems in the down levels. It divides the main problem into
platform [17].
simple sub problems, and sometime decomposes these sub
problems into more simple problems if it possible, until each The following OpenMP is written on C++, and FORTRAN
sub problem has a simple solution. Then, it gathers all the sub [17]. In the example, the loop adds the Y array to the X array
problems’ results to obtain the solution for the main problem. where are executed in parallel.
Giving sufficient parallelism, the sub problems can be solved
concurrently. Another OpenMP example
This example demonstrates the using of Wordcount program
The structure of the divide and conquer algorithm is with OpenMP. The Wordcount is a program that counts the
perform as a tree where the main problem is the root and the number of occurrence unique word in each read file. After the
sub problems are the nodes. The dividing and combination of program reads most of the files, it will show the whole number
the sub problems to obtain the main problem formulate uses of the unique words and the most frequent word in the files. In

1026

Authorized licensed use limited to: Nitte Meenakshi Institute of Technology. Downloaded on October 27,2022 at 13:21:27 UTC from IEEE Xplore. Restrictions apply.
the file, each line can be split between several processors that
provide sufficient parallelism to speed up the word counting.
The processes’ results unite to deliver the final result.
The input program will be a file which could be quite large,
because of that the file will be parsed in parallel. In order, this
program needs to create an object of the following classes:

• Parser obj: it is used to parse multiple lines of the file


into strings.
• Collect obj:it is used to collect and combine the
results that are produced by Parser objects.

The Wordcount program will be used with the OpenMP


as shown in Figure 4. It features for executing the while
loop in parallel, and that will accomplish the parse phase.
Consequently, using the parallel for directive, the collect phase
will be operated. The OpenMP feature will be used to count
the number of different occurrence word and the whole number
of words [17]. OpenMP provides the user the parallelism
that is marked in the loop and other part of the code. The
OpenMP has the advantage where is easy to transfer the Fig. 3: TBB WordCount program [17].
process from sequential program into parallel OpenMP pro-
gram. The OpenMP scalability is restricted by the number of
process cores and the complexity of the computational logic.
In general, the OpenMP presents a standard between diversity
of shared memory architectures, and creates a set of directive
for a programming of shared memory machines. In addition,
the OpenMP is easy to use by providing the capability of
incrementing parallelize a serial program, and the capability
of implementing fine-grain and coarse-grain parallelism [16].
It supports the following language C, C++ and FORTRAN
(77, 90, and 95). Finally, The OpenMP 3.0 is a new version
that was developed in 2008; it makes the parallel programming
model more flexible by providing a dynamic task

B. Threading Building Block


The Threading Building Block (TBB) presents an approach
of expressing the parallelism in C++ program. It is a C++ tem- Fig. 4: OpenMP WordCount program [17].
plate library that includes an implementation for some parallel
algorithms, such as Parallel for, Parallel while, Paralle reduce
and more. Furthermore, the TBB “reduce the developer burden
of managing complex and error-prone parallel data structure” According to [8], in 2011 Intel adopted Open CASCADE
by offering a set of data structure parallel, such as Concr- with Intel TBB. The TBB is being selected for Open CAS-
rent queue, Concurrent vector, and Concurrent hash [17]. The CADE, because TBB provides the scalable memory allocated,
TBB has a central execution unit that can be scheduled by the parallel algorithms and simplicity of integration. The BBT
library’s run-time engine. is one of the best shared memory parallel programming that
form on run time library. It is aimed to reduce the effort that
An Example of Threading Building Block is the Word- program is needed to write parallel program. The TBB presents
Count program illustrated in Figure 3. It is including the pervi- a protection for the developer from the precise differences of
ous classes (Parser obj and Collect obj) will be used again in thread management, and the programming error. In contrast,
this example with TBB, consequently, it will exploit numerous comparing the TTB with OpenMP, the OpenMP is easier
features in TBB. By Using Parallel for, each thread will be for converting from sequential program into parallel program.
operated on a chunk that contains the broken iteration space. Furthermore, creating a program is still complicated in TBB
A Blocked range abstracts the one-dimensional space. The [17].
ApplyParser class is used the parallel for by invoking it from
the main program. The TBB permits the appropriate reduction VII. I MPLEMENTATION AND E VALUATION
operation in parallel. Another class called ReduceCollect Class
shows the characteristics of a reduction operation. For each In order to evaluate OpenMP parallelism, we use the
element in the block, the Operator () identifies the iterative matrix multiplication scenario as it delivers the best case
operation. A reduction object is required [17]. for parallelization. Matrix multiplication algorithm computes

1027

Authorized licensed use limited to: Nitte Meenakshi Institute of Technology. Downloaded on October 27,2022 at 13:21:27 UTC from IEEE Xplore. Restrictions apply.
is supported by OpenMP, while TBB provides similar data
access locks called mutex. Finally, TBB does not require
any special language or compiler unlike OpenMP that uses
program directives for compiler [18], [19].

TABLE I: Comparison between OpenMP and TBB.


Feature OpenMP TBB
Offloading + -
Static scheduling + -
Parallelism complexity - +
Nested parallelism support - +
Abstraction of memory hierarchy + -
Mutual exclusion + -
Not required compiler - +
Fig. 5: OpenMP Speedup.

VIII. C ONCLUSIONS
the product of two input matrices. The algorithm operation In summary, the programmers are able to express the
carries out by three main nested loops, where two loops for parallelism aspect in their programs while concurrently exploit
initialization and the other one for multiplication. The matrix the underlying hardware architecture capabilities by using
multiplication algorithm was tested in dual-core Intel Core(R) different parallel programming models. Each parallel program-
N4000 CPU, 1.10 GHz system with RAM of size 4 GB. We ming model has different way of exploiting parallelism. Cur-
evaluate the OpenMP parallelism model in terms of speedup rently, the parallel programming models research area previews
and efficiency. As shown in Figure 5, the time decreeses as the advance development between whiles. We believe during a
number of threads grow. Similar, the efficiency of the model few years the parallel programming performance, will be ten
rises with the increasing number of threads in Figure 6. times than it is including the current and new models. This
paper describes briefly diverse parallel programming models
including Paralliezing Compilers, Parallel Language, High
Performance FORTRAN, Message Passing, Virtual Shared
Memory, Data Parallel, Programming Skelton and The Parti-
tion Global Address Space. Furthermore, it explains the most
common paradigms: Task Farming, Single Program Multiple
Data, Divide and Conquer, Data Pipelining, Hybrid models,
and Speculative Parallelism. Finally, it discusses the parallel
design algorithms, and illustrates some examples for parallel
programming models.

R EFERENCES
[1] R. Buyya et al., “High performance cluster computing: Architectures
and systems (volume 1),” Prentice Hall, Upper SaddleRiver, NJ, USA,
vol. 1, no. 999, p. 29, 1999.
Fig. 6: OpenMP Efficiency. [2] B. Barney et al., “Introduction to parallel computing,” Lawrence Liv-
ermore National Laboratory, vol. 6, no. 13, p. 10, 2010.
[3] W. Giloi, “Parallel programming models and their interdependence with
The closer model to OpenMP is TBB, thus a comparison parallel architectures,” in Proceedings of Workshop on Programming
Models for Massively Parallel Computers. IEEE, 1993, pp. 2–11.
between the two models have been made. To compare OpenMP
[4] J. Peng, C. Hu, and J. Xi, “Msi a new parallel programming model,”
and TBB based on criteria of our interest, we summaries in 2009 WRI World Congress on Software Engineering, vol. 1. IEEE,
the most significant differences between both models. As 2009, pp. 56–60.
illustrated in Table I, offloading feature is the parallelism [5] Y. Du, Y. Zhao, B. Han, and Y. Li, “Optimistic parallelism based on
process between both the host and device that mainly support speculative asynchronous messages passing,” in International Sympo-
the acceleratorbased system. OpenMP supports the offloading sium on Parallel and Distributed Processing with Applications. IEEE,
between both host and target device, while TBB only support 2010, pp. 382–391.
offloading on host side. Another feature is static scheduling [6] H. Wu, “Design-pattern based parallel programming model and system
implementation,” in 2008 4th International Conference on Wireless
where threads execution order can be controlled. The execution Communications, Networking and Mobile Computing. IEEE, 2008,
order is supported by OpenMP while it is missed in TBB. pp. 1–5.
OpenMP does not support nested and complex parallel patterns [7] F. Plavec, “Dependence testing for parallelizing compilers,” 2003.
as in TBB. In comparison, OpenMP provides construct mem- [8] Intel., “oneapi threading building blocks,”
ory hierarchy for programmers to specify memory locations. https://software.intel.com/content/www/us/en/develop/tools
Mutual exclusion for protecting data access in parallel program /oneapi/components/onetbb.html.

1028

Authorized licensed use limited to: Nitte Meenakshi Institute of Technology. Downloaded on October 27,2022 at 13:21:27 UTC from IEEE Xplore. Restrictions apply.
[9] M. H. P. C. C. SP Parallel Programming Workshop, “Message passing [15] K. P. Kenneth Pedersen, “Scientific applications in distributed systems,”
interface,” http://www.mhpcc.edu/training/workshop20/mpi/MAIN.html. http://www.idi.ntnu.no/grupper/su/sif8094-reports/2001/p7.pdf, 2001.
[10] S. H. Roosta, Parallel processing and parallel algorithms: theory and [16] “The openmp api specification for parallel programming,”
computation. Springer Science & Business Media, 2012. https://www.openmp.org//.
[11] A. Chalmers, E. Reinhard, and T. Davis, Practical parallel rendering. [17] Oracle, “Developing parallel programs - a discussion of
CRC Press, 2002. popular models,” https://www.oracle.com/technetwork/server-
[12] M. Parmar, “Data parallel model and object oriented storage/solarisstudio/documentation/oss-parallel-programs-170709.pdf,
model,” http://www.gazhoo.com/doc/201006110254229749/DATA- 2016.
PARALLEL+MODEL, 2009. [18] E. Ajkunic, H. Fatkic, E. Omerovic, K. Talic, and N. Nosovic, “A
[13] R. Galal, “Partitioned global address space (pgas),” comparison of five parallel programming models for c++,” in 2012
https://www.mohamedfahmed.wordpress.com/2010/05/06/partitioned- Proceedings of the 35th International Convention MIPRO. IEEE, 2012,
globaladdress-space-pgas/, 2010. pp. 1780–1784.
[14] I. Foster, “Designing parallel algorithms,” [19] S. Salehian, J. Liu, and Y. Yan, “Comparison of threading programming
https://www.mcs.anl.gov/ itf/dbpp/text/node14.html, 1995. models,” in 2017 IEEE International Parallel and Distributed Process-
ing Symposium Workshops (IPDPSW). IEEE, 2017, pp. 766–774.

1029

Authorized licensed use limited to: Nitte Meenakshi Institute of Technology. Downloaded on October 27,2022 at 13:21:27 UTC from IEEE Xplore. Restrictions apply.

You might also like