Performance and Power Comparisons of MPI Vs Pthread Implementations On Multicore Systems

See discussions, stats, and author profiles for this publication at: https://www.researchgate.
net/publication/261093868
Performance and power comparisons of MPI Vs Pthread implementations on

multicore systems
Conference Paper · March 2013

DOI: 10.1109/Innovations.2013.6544384
CITATIONS READS
3 2,678
3 authors, including:
Abu Asaduzzaman Hesham El-Sayed

Wichita State University United Arab Emirates University
103 PUBLICATIONS 349 CITATIONS 62 PUBLICATIONS 586 CITATIONS
SEE PROFILE SEE PROFILE
Some of the authors of this publication are also working on these related projects:
Parallel Programming View project
Efficient Cache Locking at Private First-Level Caches and Shared Last-Level Cache for Modern Multicore Systems View project
All content following this page was uploaded by Abu Asaduzzaman on 05 January 2016.
The user has requested enhancement of the downloaded file.

IIT'13 1569696365
Performance and Power Comparisons of MPI Vs Pthread Implementations on

Multicore Systems
Abu Asaduzzaman Fadi N. Sibai Hesham El-Sayed

Wichita State University Saudi Aramco UAE University
EECS Department R&D Center Faculty of IT
Wichita, Kansas 67260-0083, USA Dhahran 31311, Saudi Arabia Al Ain, UAE
Abu.Asaduzzaman@wichita.edu fadi.sibai@aramco.com helsayed@uaeu.ac.ae
Abstract—The advancement of multicore systems demands Ideally, adding more MPI processes to the simulation
applications with more threads. In order to facilitate this would bring about a linear decrease in execution time and
demand, parallel programming models such as message passing power consumption. However, the benefits of adding more
interface (MPI) are developed. By using such models, the processes will be reduced due to overhead associated with the
execution time and the power consumption can be reduced additional processes – the communication delay and
significantly. However, the performance of MPI programming dependencies among the process [5, 6].
depends on the total number of threads and the number of
processing cores in the system. In this work, we experimentally In this paper, we experiment with two programming
study the impact of Open MPI and POSIX Thread (Pthread) techniques Pthread and Open MPI using a data dependent
implementations on performance and power consumption of program such as heat transfer and a data independent
multicore systems. Data dependent (like heat conduction on 2D computation like matrix multiplication and analyze their
surface) and data independent (like matrix multiplication) impact on execution time and consumed power.
applications are used with high performance hardware in the
experiments. Simulation results suggest that both This paper is organized as follows. In section 2, we review
implementations of more threads running in a system with more the background. Section 3 briefly describes the problems.
cores have potential to reduce the execution time with negligible Section 4 presents the experimental setup. Section 5 discusses
or little increase in total power consumption. It is observed that the results in terms of execution time and total power
the performance of MPI implementation varies (due to the consumption. Finally, Section 6 concludes the paper.
dynamic communication overhead among the processing cores).
II. BACKGROUND STUDY
Keywords—data dependency; message passing interface; Parallel programming is becoming more and more popular
multicore architecture; Open MPI; Pthread; because of its potential to improve performance, especially in
I. INTRODUCTION multicore architectures. In the past, computer software has
been written using serial computation concepts which is
Multicore architecture helps improve performance/power usually less efficient than multithreaded parallel computation.
ratio by dividing a large job into smaller tasks and executing
them concurrently using multiple cores at a lower frequency. A thread (i.e., a thread of execution) is the smallest
However, providing more cores does not directly translate into sequence of programmed instructions that can be managed
performance [1, 2]. Multithreaded parallel programming has independently by an operating system scheduler. A thread is a
the ability to execute a computation with less execution time light-weight process. Most modern processors support multiple
and less power consumption. Therefore, parallel programming threads and simultaneous multithreading. On a single-processor
using message passing interface can be a better choice to system, multithreading generally occurs by time-division
increase performance [3]. multiplexing. On multicore systems, the threads will actually
run at the same time, with each processor or core running a
Most single-core and multicore processors support particular thread [7].
multithreading. POSIX Threads (usually referred as Pthreads)
is a POSIX standard for threads. For more than a decade, MPI In parallel computing, large problems are often divided into
has been the choice of high performance computing and it has smaller ones, which are then solved concurrently (i.e., in
proven its capability in delivering higher performance in parallel). Parallel computing concept is a rapidly-growing field,
parallel applications [4]. The MPI standard and middleware but many problems do not currently scale well onto multiple
helps ensure that MPI programs continue to perform well on processors. In many cases, much of the benefit of parallelism is
parallel and cluster architectures across a wide variety of lost because of inefficiencies in the parallel program structure
network fabrics. Almost all MPI implementations bind the or in failing to hide the latency in message passing [3].
implementation of an MPI process to an operating system (OS) MPI is a message-passing application programmer interface
process where normally a “one process” per “processor core” – this language-independent communications protocol is used
mapping is used [5]. As a result, the notion of an MPI process to program parallel computers for high-performance,
is tightly bound to the physical resources of the machine, in scalability, and portability. MPI remains the dominant model
particular the number of cores and OS processes that can be used in high-performance computing today [4, 8, 9]. However,
created.
1
MPI-based programs decrease in performance with increasing
number of processes running in parallel [5]. While these may
appear to be of little consequence on small applications, many
large applications can be significantly affected. Therefore, it is
important to study the behavior of multiple threads versus the
number of processing cores.
III. PROBLEMS CONSIDERED
Different applications scale differently on multicore
systems. We consider matrix multiplication (a data
independent problem) and heat transfer on 2D surface (a data
dependent problem) in this work.
A. Data Independent Problem
Matrix multiplication is one of the most common numerical
operations for data independent computation. A data
independent computation is one in which a program statement
does not refer to the data of a preceding statement. Figure 1
shows how a 4 rows x 3 columns matrix C is obtained by
multiplying a 4 rows x 2 columns matrix A and a 2 rows x 3
columns matrix B. Here, item c1,2 can be determined as
(a1,1*b1,2 + a1,2*b2,2); c3,3 can be determined as (a3,1*b1,3
+ a3,2*b2,3); and so on.
Fig. 2. Major steps in matrix multiplication.
B. Data Dependent Problem

A data dependent computation is one in which a program
statement refers to the data of a preceding statement. To
simulate a data dependent problem, we implemented heat
conduction problem using partial heat transfer equation. To
reduce the problem size, temperature of the room is
represented by floating point value that spans across discrete
points that the data points for data dependent problem represent
the temperature of the room at discrete points along its length.
Each of the iterations represents a step in time with the data
point being updated to represent the change in heat from the
previous time step. Each point depends on five points from the
previous time step; the point itself, the point one to the left of
it, the point one to the right of it, the upper point and the lower
point of it. Figure 3 illustrates the dependencies required for
calculating a value in a room with five points. The arrows
show the dependencies from the four points in step 1 needed to
Fig. 1. Matrix multiplication. calculate a single point in step 2.
It should be noted that all items of matrix C are
independent (i.e., items c1,2 and c2,1 do not depend on item
c1,1) and can be calculated simultaneously. The parallelization
of above matrix multiplication using divide-and-conquer
method for MPI follow the major steps as described in Figure
2.
The tasks of the slaves are to receive matrices from master
and do the multiplications using conventional method on the
sub matrices. Finally, the result is sent back to the master core.
This method allows execution time to be reduced from O
(n*m) to O (n*m/P) where P is the number of threads or
processes. Theoretical speed up from Serial Processing to
Parallel Method will be P times, where P is the number of Fig. 3. Dependencies to calculate a single point.
cores in the system. Here, temperature of points ‘a’, ‘b’, ‘c’, and ‘d’ should be
calculated in step 1 and then temperature of point ‘e’ should be
calculated depending on the value of ‘a’, ‘b’, ‘c’ and ‘d’. Since
all points have dependency on other points distributing tasks
2
involves dividing grid into sub-grids. Each sub-grid will then TABLE I. PLATFORMS PARAMETERS
in turn exchange information with adjacent sub-grid, Workstation Supercomputer
propagating data in cascading manner. So it is difficult to split
up the iteration space so that reasonable section can be 2x Quad-Core Intel 4x Octa-Core AMD
CPU
calculated in parallel. Xeon E5506 Opteron 6134
By distributing tasks/threads among the cores, overall Speed 2.13GHz 2.3 GHz per node
execution-time is reduced from O (n*m*i) to O (n*m*c/p * I)
where n and m are dimensions of grid, p as number of 3GB DDR3 64GB DDR3
RAM
processes, I as iterations, and c as communication latency. (3x1GB 1333MHz) (16x4GB 1333MHz)
Ideally, more MPI processes help decrease the execution time
Mainboard Dell 0CRH6C Dell 06JC9T
and also power consumption. But because of dependencies
among segments that theory does not work because of Chipset Intel X58 I/O + ICH10R AMD SR5650
overhead associated with additional process and
communication among them. Operating
Linux(Debian) Linux (RedHat)
System
IV. EXPERIMENTAL SETUP 230W (idle), 328W
Power 150W (idle), 254W (peak)
As overall system performance is influenced by the (peak)
experimental platform, we briefly discuss the workloads,
GPU Card NVIDIA Tesla C2075 NONE
software, and hardware used in this experiment.
A. Workloads
Execution Time: Execution time is an important factor
For the data independent computation, we use a 1024 x which influences an application developer’s choice of
1024 matrix to perform matrix multiplication for both programming models [3]. Execution time is defined as the time
Pthread/C and MPI implementations. required for computing the task (not for completing the whole
In the case of data dependent computation, we perform the program). For example, time to compute matrix C, not the time
computations which are related to heat transfer in a room. In to complete the entire matrix multiplication application. MPI
our experiment, we consider the following parameters: the total has its own time function in library like MPItime(). We call the
number of iteration is 10,000 and the dimension of the room is function to obtain the execution time.
640 x 480 square-ft. Consumed Power: For all computing platforms including
B. Software high-performance supercomputers, servers, workstations/PCs,
and embedded systems, it is very important that the
Most commonly-used language for different parallel programming model is less power consuming over other
programming model is the high-level language C. It is very programming models. On the supercomputer, we measure the
portable: compilers exist for virtually every processor. power that is consumed by the application with the help of the
Considering popularity, we embed parallel programming ipmitool software. We inject our code into the software and it
model MPI program on serial program language C. counts the consumed power. On the workstation, we use the
MPI uses message passing technique to send and receive watt-meter power analyzer meter “watts up”, which is capable
data. Message passing involves the transfer of data from one of measuring any 120VAC appliance, to measure the power of
process (send) to another process (receive) [10]. the workstation consumed by the computation.
C. Hardware V. RESULTS
Two different platforms with distinct configurations are We discuss some experimental results from the MPI/C and
used in this experiment. Those configuration parameters, for Pthread/C implementations in the following subsections.
one workstation and one node of a supercomputer, are
summarized in Table I. The workstation is an 8-core system A. Data Independent Problem
from Intel, runs at 2.13 GHz, and has 3 GB RAM. The To complete the data independent program, both Open MPI
supercomputer has 32 AMD cores per node, runs at 2.3 GHz, and Pthread/C implementations take equal amounts of time and
and has 64 GB RAM. Both operate on Linux operating system: negligible communication overhead for same-node system.
workstation uses most recent Debian and supercomputer uses The results of execution time of different combinations of
RedHat. threads and cores are presented in Table II. From the
experimental results, it is observed that both computer systems
D. Important Parameters yield better performance using Pthread than MPI (in most
Any comparison between a variety of methodologies or cases). Pthread has the advantage over MPI for sharing
models requires clearly defined parameters on the basis of memory and distributing tasks among lightweight threads
which ones can be compared and contrasted. With respect to instead of processes. The performance of the Pthread
parallel programming models, we consider the following two implementation on the supercomputer is the best, 30%-6.7x
important parameters as the basis for comparison: execution better than the Open MPI implementation on the
time and total power consumption. supercomputer node.
3
TABLE II. COMPARISON OF DATA INDEPENDENT EXECUTION TIME workstation and supercomputer are shown in Table III and
Data Independent Computation (Matrix Multiplication) Figure 5. Power consumption due to the supercomputer is
always bigger than the workstation and is a function of the
Execution Time (sec) number of used cores (8 vs. 32) and communication gears
Number of Workstation Supercomputer
(power consumption gap between both machines rises after 8
Threads threads, and stabilizes after 32 threads).
Open Open
Pthread/C Pthread/C
MPI MPI
1 49.35 50.79 42.07 57.89 Execu5on Time
2 30.09 32.30 21.28 28.83 80
4 16.27 13.35 11.05 14.32 C/Worksta5on
60 MPI/Worksta5on
Time (sec)
7 10.95 9.95 6.58 8.90
8 8.89 8.19 5.36 7.57 C/SuperComputer
40
9 9.30 9.02 5.36 7.98
MPI/SuperComputer
16 8.95 7.27 3.18 4.82 20
20 8.10 8.89 2.62 3.77
24 8.34 10.11 2.30 4.13 0
30 8.21 7.37 1.94 2.60 1 2 4 7 8 9 16 20 24 30 31 32 33 34 40 45
31 8.11 7.83 1.92 2.60 Number of Threads
32 8.16 8.39 1.83 2.42
Fig. 4. Execution Time of data independent programs.
33 8.24 7.30 1.81 12.15
B. Data Dependent Problem
34 8.18 9.16 2.34 13.33
Due to data-dependency, each process should wait for some
40 8.14 14.94 2.09 10.30 other processes to finish. As mentioned earlier, the estimated
45 8.28 12.34 1.96 8.35 execution-time is O (n*m*I) for serial program. A parallel
implementation can be estimated by O (n*m*c/p *I) where n
As the number of threads increases (from 1 to 8), the and m are dimensions and p is number of processes, and I are
execution times for both Pthread/C and MPI programs decrease the number of iterations. Table IV and Fig. 6 present the data
for both workstation and supercomputer. For 9 threads to 32 collected for the data dependent computation.
threads, Pthread/C and MPI execution times changes slightly
on workstation. However for 9 threads to 32 threads, both TABLE III. COMPARISON OF DATA INDEPENDENT POWER CONSUMED
Pthread/C and MPI execution times keep decreasing (in most Data Independent Computation (Matrix Multiplication)
cases) with the increase in number of threads on the
Power Consumption (watts)
supercomputer node. This is because we use 8 cores in the
workstation while we use 32 cores in the supercomputer node. Number of Workstation Supercomputer
Threads
Execution time for large number of threads (9 to 33 and Open Open
beyond) remains almost the same for the workstation because Pthread/C Pthread/C
MPI MPI
of no communication overhead. Results (see Figure 4) also 1 182 182 318 310
show that MPI execution time on supercomputer increases 2 187 187 324 324
significantly when the number of threads is increased from 32 4 202 201 338 336
to 33 and beyond; but Pthread/C execution time on 7 215 220 366 346
supercomputer remains almost unchanged. This is due to the
8 222 225 378 354
communication overhead among the cores (32 cores in a
supercomputer node) when using MPI message passing. 9 222 226 374 364
16 222 225 430 412
The supercomputer node (4x more cores than the 20 222 225 444 448
workstation) relatively exhibits good efficiency with 24 or
24 222 225 452 464
more threads on the Pthread implementation, and with 20-32
30 222 226 472 490
threads on the Open MPI implementation. On the Open MPI
31 227 225 476 494
implementation, with less than 20 threads, not much work is
taking place on the supercomputer node. With more than 32 32 225 228 476 496
threads, the communication time overhead rises. A linear 33 223 226 478 496
increase of power consumed is observed with the increase of 34 222 224 478 496
cores usage (1 to 8 cores for the workstation and 1 to 32 cores 40 226 225 478 496
for the supercomputer). The collected results of Pthread and 45 223 226 476 496
Open MPI performance in terms of consumed power for both
4
Power Consumption
Execution Time
C/Worksta5on MPI/Worksta5on
600 C/worksta5on MPI/Worksta5on
C/SuperComputer MPI/SuperComputer 120 C/SuperComputer MPI/SuperComputer
100
400
Power (wa9)
Time (sec)
80
200 60
40
0 20
1 2 4 7 8 9 16 20 24 30 31 32 33 34 40 45
Number of Threads 0
1 2 4 7 8 9 16 20 24 30 31 32 33 34 40 45
Fig. 5. Power Consumption of data independent programs.
Number of Threads
The Pthread implementation is the best, on the workstation Fig. 6. Execution Time of data dependent program.
for 1-9 threads, and on the supercomputer node for 16+
threads. As the number of threads increases (from 1 to 8), the The acquired results for power consumption for data
execution time decreases for all cases. For 9 to 32 threads, dependent computation are shown in Table V and Figure 7.
both Pthread/C and MPI execution times keep decreasing with Again, power consumption is a function of the number of used
the increase of the number of threads only on supercomputer. cores. We observe that power consumption slightly falls after 8
When the number of dependent processes increases above the threads on the workstation for both Pthread/C and Open MPI
number of cores, context switching takes place, and as a result implementations.
execution time will be drastically increased. As the number of TABLE V. COMPARISON OF DATA DEPENDENT POWER CONSUMED
threads increases from 32 to 33, there is a significant increase
Data Dependent Computation (Heat Transfer)
in execution time owing to Opteron’s NUMA architecture with
point-to-point communication links. Power Consumption (watt)
Number of Workstation Supercomputer
TABLE IV. COMPARISON OF DATA DEPENDENT EXECUTION TIME Threads
Data Dependent Computation (Heat Transfer) Open Open
Pthread/C Pthread/C
MPI MPI
Execution Time (sec) 1 187 190 316 326
Number
of Workstation Supercomputer 2 195 206 326 332
Threads Open Open 4 214 224 342 346
Pthread/C Pthread/C
MPI MPI
1 69.18 78.40 96.46 103.11 7 227 243 376 366
2 34.72 39.23 48.40 51.81 8 246 248 396 374
4 17.36 19.95 24.36 26.04 9 222 249 404 392
7 10.03 11.37 13.90 15.00 16 227 238 446 434
8 9.25 18.07 12.27 13.12 20 226 240 462 460
9 9.53 17.64 10.84 11.61 24 226 240 486 474
16 8.97 15.05 6.16 6.67
30 226 241 508 508
20 8.99 12.25 4.93 5.33
24 9.07 13.61 4.14 4.50 31 225 239 512 512
30 8.93 14.36 3.33 3.79 32 224 240 518 512
31 8.65 13.46 3.11 3.66 33 224 239 520 512
32 8.92 12.77 3.11 3.62 34 223 240 520 512
33 8.66 12.88 4.66 12.87
40 224 237 520 -
34 8.89 12.87 4.95 88.07
45 225 240 520 -
40 8.94 12.42 4.92 -
45 8.39 11.83 4.11 -
On the workstation, the Pthread implementation is 13%-

95% better than Open MPI implementation (peaking at 8
threads). On the supercomputer node, the Pthread
implementation is 6.9%-2.76x better than the Open MPI
implementation (peaking at 33 threads).
5
implementations of both selected applications. One exception
Power Consumption is the matrix multiplication with 4-8 threads, where the Open
MPI implementation on the workstation exhibits the lowest
time x power product. When comparing both applications’
600 Pthread implementations on both machines, the time x power
product is better (lower) on the supercomputer for the matrix
500 multiply application (for 9 or more threads), and better on the
workstation for the heat transfer application (except for 30-32
Power (wa9)
400 threads). Thus for computer systems where power consumption

is important, Pthread/C has the advantage in the range of
300 considered data set sizes.
200 As to future work, we plan to achieve higher speedups with
the CUDA/C technology.
100
REFERENCES
0 [1] K. Jase, “The Future of Computers & PC Hardware - Prediction &
1 2 4 7 8 9 16 20 24 30 31 32 33 34 40 45 Roadmap To 2015,” 2012. URL:
http://kangaroojase.hubpages.com/hub/The-Future-Of-Computers-
Number of Threads Hardware-Roadmap-To-2015
[2] D. Geer, “Chip makers turn to multicore processors,” IEEE Computer,
Fig. 7. Power consumption by data dependent program. pp.11–13, May 2005.
[3] D.B. Skillcorn and D. Talia, “Models and Languages for Parallel
VI. CONCLUSIONS Computation,” ACM Computing Surveys, Vol. 30, No. 2, pp. 123-169,
June 1998.
The Pthread/C implementations on the 32-core
[4] V.R. Basili, J.C. Carver, D. Cruzes, L.M. Hochstein, J.K. Hollingsworth,
supercomputer node performed better than the Open MPI F. Shull, and M.V. Zelkowitz, “Understanding the high performance-
implementations of both heat transfer and matrix multiply computing community: A software engineer’s perspective,” IEEE
applications for any number of threads, except for 1-9 threads Software, vol. 25, no. 4, pp. 29–36, 2008.
where the heat transfer Pthread/C implementation performed [5] “The Message Passing Interface (MPI) standard,” 2012. URL:
better on the shared memory 8-core workstation. The Open http://www.mcs.anl.gov/research/projects/mpi/
MPI overhead was too high for the selected data sets and [6] M.A. Hossaina, U. Kabirb, et al, "Impact of data dependencies in real-
applications. time high performance computing" MICPRO, 26 (2002), pp. 253–261.
[7] B. Barney, "POSIX Threads Programming," Lawrence Livermore
Experimental results show that for a small number of National Laboratory. (Accessed in 2012) URL:
threads (1 to 8), the application’s execution time decreases as https://computing.llnl.gov/tutorials/pthreads/
the number of threads increases; but the total power [8] B. Barney, "Message Passing Interface (MPI)," (2012) URL:
consumption increases as the number of threads increases, https://computing.llnl.gov/tutorials/mpi/
peaking at the number of cores used. [9] "Introduction to MPI Programming!" (Accessed in 2012) URL:
http://www.rocksclusters.org/rocksapalooza /2006/lab-mpi.pdf
When treating the execution time and power consumption [10] H. Jin, D. Jespersen, et al, "High Performance Computing Using MPI
equally by considering their product, the Pthread and OpenMP on Multi-core Parallel Systems," (2012) URL:
implementations on both machines outperform the Open MPI citeseerx.ist.psu.edu/ viewdoc/download?doi= 10.1.1.190.6479-1.pdf
View publication stats 6

Performance and Power Comparisons of MPI Vs Pthread Implementations On Multicore Systems

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Performance and Power Comparisons of MPI Vs Pthread Implementations On Multicore Systems

Uploaded by

Copyright:

Available Formats

See discussions, stats, and author profiles for this publication at: https://www.researchgate.

Performance and power comparisons of MPI Vs Pthread implementations on

Conference Paper · March 2013

Abu Asaduzzaman Hesham El-Sayed

SEE PROFILE SEE PROFILE

Parallel Programming View project

The user has requested enhancement of the downloaded file.

Performance and Power Comparisons of MPI Vs Pthread Implementations on

Abu Asaduzzaman Fadi N. Sibai Hesham El-Sayed

B. Data Dependent Problem

On the workstation, the Pthread implementation is 13%-

400 threads). Thus for computer systems where power consumption

View publication stats 6

You might also like