Professional Documents
Culture Documents
net/publication/261093868
CITATIONS READS
3 2,678
3 authors, including:
Some of the authors of this publication are also working on these related projects:
Efficient Cache Locking at Private First-Level Caches and Shared Last-Level Cache for Modern Multicore Systems View project
All content following this page was uploaded by Abu Asaduzzaman on 05 January 2016.
Abstract—The advancement of multicore systems demands Ideally, adding more MPI processes to the simulation
applications with more threads. In order to facilitate this would bring about a linear decrease in execution time and
demand, parallel programming models such as message passing power consumption. However, the benefits of adding more
interface (MPI) are developed. By using such models, the processes will be reduced due to overhead associated with the
execution time and the power consumption can be reduced additional processes – the communication delay and
significantly. However, the performance of MPI programming dependencies among the process [5, 6].
depends on the total number of threads and the number of
processing cores in the system. In this work, we experimentally In this paper, we experiment with two programming
study the impact of Open MPI and POSIX Thread (Pthread) techniques Pthread and Open MPI using a data dependent
implementations on performance and power consumption of program such as heat transfer and a data independent
multicore systems. Data dependent (like heat conduction on 2D computation like matrix multiplication and analyze their
surface) and data independent (like matrix multiplication) impact on execution time and consumed power.
applications are used with high performance hardware in the
experiments. Simulation results suggest that both This paper is organized as follows. In section 2, we review
implementations of more threads running in a system with more the background. Section 3 briefly describes the problems.
cores have potential to reduce the execution time with negligible Section 4 presents the experimental setup. Section 5 discusses
or little increase in total power consumption. It is observed that the results in terms of execution time and total power
the performance of MPI implementation varies (due to the consumption. Finally, Section 6 concludes the paper.
dynamic communication overhead among the processing cores).
II. BACKGROUND STUDY
Keywords—data dependency; message passing interface; Parallel programming is becoming more and more popular
multicore architecture; Open MPI; Pthread; because of its potential to improve performance, especially in
I. INTRODUCTION multicore architectures. In the past, computer software has
been written using serial computation concepts which is
Multicore architecture helps improve performance/power usually less efficient than multithreaded parallel computation.
ratio by dividing a large job into smaller tasks and executing
them concurrently using multiple cores at a lower frequency. A thread (i.e., a thread of execution) is the smallest
However, providing more cores does not directly translate into sequence of programmed instructions that can be managed
performance [1, 2]. Multithreaded parallel programming has independently by an operating system scheduler. A thread is a
the ability to execute a computation with less execution time light-weight process. Most modern processors support multiple
and less power consumption. Therefore, parallel programming threads and simultaneous multithreading. On a single-processor
using message passing interface can be a better choice to system, multithreading generally occurs by time-division
increase performance [3]. multiplexing. On multicore systems, the threads will actually
run at the same time, with each processor or core running a
Most single-core and multicore processors support particular thread [7].
multithreading. POSIX Threads (usually referred as Pthreads)
is a POSIX standard for threads. For more than a decade, MPI In parallel computing, large problems are often divided into
has been the choice of high performance computing and it has smaller ones, which are then solved concurrently (i.e., in
proven its capability in delivering higher performance in parallel). Parallel computing concept is a rapidly-growing field,
parallel applications [4]. The MPI standard and middleware but many problems do not currently scale well onto multiple
helps ensure that MPI programs continue to perform well on processors. In many cases, much of the benefit of parallelism is
parallel and cluster architectures across a wide variety of lost because of inefficiencies in the parallel program structure
network fabrics. Almost all MPI implementations bind the or in failing to hide the latency in message passing [3].
implementation of an MPI process to an operating system (OS) MPI is a message-passing application programmer interface
process where normally a “one process” per “processor core” – this language-independent communications protocol is used
mapping is used [5]. As a result, the notion of an MPI process to program parallel computers for high-performance,
is tightly bound to the physical resources of the machine, in scalability, and portability. MPI remains the dominant model
particular the number of cores and OS processes that can be used in high-performance computing today [4, 8, 9]. However,
created.
1
MPI-based programs decrease in performance with increasing
number of processes running in parallel [5]. While these may
appear to be of little consequence on small applications, many
large applications can be significantly affected. Therefore, it is
important to study the behavior of multiple threads versus the
number of processing cores.
III. PROBLEMS CONSIDERED
Different applications scale differently on multicore
systems. We consider matrix multiplication (a data
independent problem) and heat transfer on 2D surface (a data
dependent problem) in this work.
A. Data Independent Problem
Matrix multiplication is one of the most common numerical
operations for data independent computation. A data
independent computation is one in which a program statement
does not refer to the data of a preceding statement. Figure 1
shows how a 4 rows x 3 columns matrix C is obtained by
multiplying a 4 rows x 2 columns matrix A and a 2 rows x 3
columns matrix B. Here, item c1,2 can be determined as
(a1,1*b1,2 + a1,2*b2,2); c3,3 can be determined as (a3,1*b1,3
+ a3,2*b2,3); and so on.
Fig. 2. Major steps in matrix multiplication.
2
involves dividing grid into sub-grids. Each sub-grid will then TABLE I. PLATFORMS PARAMETERS
in turn exchange information with adjacent sub-grid, Workstation Supercomputer
propagating data in cascading manner. So it is difficult to split
up the iteration space so that reasonable section can be 2x Quad-Core Intel 4x Octa-Core AMD
CPU
calculated in parallel. Xeon E5506 Opteron 6134
By distributing tasks/threads among the cores, overall Speed 2.13GHz 2.3 GHz per node
execution-time is reduced from O (n*m*i) to O (n*m*c/p * I)
where n and m are dimensions of grid, p as number of 3GB DDR3 64GB DDR3
RAM
processes, I as iterations, and c as communication latency. (3x1GB 1333MHz) (16x4GB 1333MHz)
Ideally, more MPI processes help decrease the execution time
Mainboard Dell 0CRH6C Dell 06JC9T
and also power consumption. But because of dependencies
among segments that theory does not work because of Chipset Intel X58 I/O + ICH10R AMD SR5650
overhead associated with additional process and
communication among them. Operating
Linux(Debian) Linux (RedHat)
System
IV. EXPERIMENTAL SETUP 230W (idle), 328W
Power 150W (idle), 254W (peak)
As overall system performance is influenced by the (peak)
experimental platform, we briefly discuss the workloads,
GPU Card NVIDIA Tesla C2075 NONE
software, and hardware used in this experiment.
A. Workloads
Execution Time: Execution time is an important factor
For the data independent computation, we use a 1024 x which influences an application developer’s choice of
1024 matrix to perform matrix multiplication for both programming models [3]. Execution time is defined as the time
Pthread/C and MPI implementations. required for computing the task (not for completing the whole
In the case of data dependent computation, we perform the program). For example, time to compute matrix C, not the time
computations which are related to heat transfer in a room. In to complete the entire matrix multiplication application. MPI
our experiment, we consider the following parameters: the total has its own time function in library like MPItime(). We call the
number of iteration is 10,000 and the dimension of the room is function to obtain the execution time.
640 x 480 square-ft. Consumed Power: For all computing platforms including
B. Software high-performance supercomputers, servers, workstations/PCs,
and embedded systems, it is very important that the
Most commonly-used language for different parallel programming model is less power consuming over other
programming model is the high-level language C. It is very programming models. On the supercomputer, we measure the
portable: compilers exist for virtually every processor. power that is consumed by the application with the help of the
Considering popularity, we embed parallel programming ipmitool software. We inject our code into the software and it
model MPI program on serial program language C. counts the consumed power. On the workstation, we use the
MPI uses message passing technique to send and receive watt-meter power analyzer meter “watts up”, which is capable
data. Message passing involves the transfer of data from one of measuring any 120VAC appliance, to measure the power of
process (send) to another process (receive) [10]. the workstation consumed by the computation.
C. Hardware V. RESULTS
Two different platforms with distinct configurations are We discuss some experimental results from the MPI/C and
used in this experiment. Those configuration parameters, for Pthread/C implementations in the following subsections.
one workstation and one node of a supercomputer, are
summarized in Table I. The workstation is an 8-core system A. Data Independent Problem
from Intel, runs at 2.13 GHz, and has 3 GB RAM. The To complete the data independent program, both Open MPI
supercomputer has 32 AMD cores per node, runs at 2.3 GHz, and Pthread/C implementations take equal amounts of time and
and has 64 GB RAM. Both operate on Linux operating system: negligible communication overhead for same-node system.
workstation uses most recent Debian and supercomputer uses The results of execution time of different combinations of
RedHat. threads and cores are presented in Table II. From the
experimental results, it is observed that both computer systems
D. Important Parameters yield better performance using Pthread than MPI (in most
Any comparison between a variety of methodologies or cases). Pthread has the advantage over MPI for sharing
models requires clearly defined parameters on the basis of memory and distributing tasks among lightweight threads
which ones can be compared and contrasted. With respect to instead of processes. The performance of the Pthread
parallel programming models, we consider the following two implementation on the supercomputer is the best, 30%-6.7x
important parameters as the basis for comparison: execution better than the Open MPI implementation on the
time and total power consumption. supercomputer node.
3
TABLE II. COMPARISON OF DATA INDEPENDENT EXECUTION TIME workstation and supercomputer are shown in Table III and
Data Independent Computation (Matrix Multiplication) Figure 5. Power consumption due to the supercomputer is
always bigger than the workstation and is a function of the
Execution Time (sec) number of used cores (8 vs. 32) and communication gears
Number of Workstation Supercomputer
(power consumption gap between both machines rises after 8
Threads threads, and stabilizes after 32 threads).
Open Open
Pthread/C Pthread/C
MPI MPI
1 49.35 50.79 42.07 57.89 Execu5on
Time
2 30.09 32.30 21.28 28.83 80
4 16.27 13.35 11.05 14.32 C/Worksta5on
60
MPI/Worksta5on
Time
(sec)
7 10.95 9.95 6.58 8.90
8 8.89 8.19 5.36 7.57 C/SuperComputer
40
9 9.30 9.02 5.36 7.98
MPI/SuperComputer
16 8.95 7.27 3.18 4.82 20
20 8.10 8.89 2.62 3.77
24 8.34 10.11 2.30 4.13 0
30 8.21 7.37 1.94 2.60 1
2
4
7
8
9
16
20
24
30
31
32
33
34
40
45
31 8.11 7.83 1.92 2.60 Number
of
Threads
32 8.16 8.39 1.83 2.42
Fig. 4. Execution Time of data independent programs.
33 8.24 7.30 1.81 12.15
B. Data Dependent Problem
34 8.18 9.16 2.34 13.33
Due to data-dependency, each process should wait for some
40 8.14 14.94 2.09 10.30 other processes to finish. As mentioned earlier, the estimated
45 8.28 12.34 1.96 8.35 execution-time is O (n*m*I) for serial program. A parallel
implementation can be estimated by O (n*m*c/p *I) where n
As the number of threads increases (from 1 to 8), the and m are dimensions and p is number of processes, and I are
execution times for both Pthread/C and MPI programs decrease the number of iterations. Table IV and Fig. 6 present the data
for both workstation and supercomputer. For 9 threads to 32 collected for the data dependent computation.
threads, Pthread/C and MPI execution times changes slightly
on workstation. However for 9 threads to 32 threads, both TABLE III. COMPARISON OF DATA INDEPENDENT POWER CONSUMED
Pthread/C and MPI execution times keep decreasing (in most Data Independent Computation (Matrix Multiplication)
cases) with the increase in number of threads on the
Power Consumption (watts)
supercomputer node. This is because we use 8 cores in the
workstation while we use 32 cores in the supercomputer node. Number of Workstation Supercomputer
Threads
Execution time for large number of threads (9 to 33 and Open Open
beyond) remains almost the same for the workstation because Pthread/C Pthread/C
MPI MPI
of no communication overhead. Results (see Figure 4) also 1 182 182 318 310
show that MPI execution time on supercomputer increases 2 187 187 324 324
significantly when the number of threads is increased from 32 4 202 201 338 336
to 33 and beyond; but Pthread/C execution time on 7 215 220 366 346
supercomputer remains almost unchanged. This is due to the
8 222 225 378 354
communication overhead among the cores (32 cores in a
supercomputer node) when using MPI message passing. 9 222 226 374 364
16 222 225 430 412
The supercomputer node (4x more cores than the 20 222 225 444 448
workstation) relatively exhibits good efficiency with 24 or
24 222 225 452 464
more threads on the Pthread implementation, and with 20-32
30 222 226 472 490
threads on the Open MPI implementation. On the Open MPI
31 227 225 476 494
implementation, with less than 20 threads, not much work is
taking place on the supercomputer node. With more than 32 32 225 228 476 496
threads, the communication time overhead rises. A linear 33 223 226 478 496
increase of power consumed is observed with the increase of 34 222 224 478 496
cores usage (1 to 8 cores for the workstation and 1 to 32 cores 40 226 225 478 496
for the supercomputer). The collected results of Pthread and 45 223 226 476 496
Open MPI performance in terms of consumed power for both
4
Power Consumption
Execution Time
C/Worksta5on
MPI/Worksta5on
600
C/worksta5on
MPI/Worksta5on
C/SuperComputer
MPI/SuperComputer
120
C/SuperComputer
MPI/SuperComputer
100
400
Power
(wa9)
Time
(sec)
80
200
60
40
0
20
1
2
4
7
8
9
16
20
24
30
31
32
33
34
40
45
Number
of
Threads
0
1
2
4
7
8
9
16
20
24
30
31
32
33
34
40
45
Fig. 5. Power Consumption of data independent programs.
Number
of
Threads
The Pthread implementation is the best, on the workstation Fig. 6. Execution Time of data dependent program.
for 1-9 threads, and on the supercomputer node for 16+
threads. As the number of threads increases (from 1 to 8), the The acquired results for power consumption for data
execution time decreases for all cases. For 9 to 32 threads, dependent computation are shown in Table V and Figure 7.
both Pthread/C and MPI execution times keep decreasing with Again, power consumption is a function of the number of used
the increase of the number of threads only on supercomputer. cores. We observe that power consumption slightly falls after 8
When the number of dependent processes increases above the threads on the workstation for both Pthread/C and Open MPI
number of cores, context switching takes place, and as a result implementations.
execution time will be drastically increased. As the number of TABLE V. COMPARISON OF DATA DEPENDENT POWER CONSUMED
threads increases from 32 to 33, there is a significant increase
Data Dependent Computation (Heat Transfer)
in execution time owing to Opteron’s NUMA architecture with
point-to-point communication links. Power Consumption (watt)
Number of Workstation Supercomputer
TABLE IV. COMPARISON OF DATA DEPENDENT EXECUTION TIME Threads
Data Dependent Computation (Heat Transfer) Open Open
Pthread/C Pthread/C
MPI MPI
Execution Time (sec) 1 187 190 316 326
Number
of Workstation Supercomputer 2 195 206 326 332
Threads Open Open 4 214 224 342 346
Pthread/C Pthread/C
MPI MPI
1 69.18 78.40 96.46 103.11 7 227 243 376 366
2 34.72 39.23 48.40 51.81 8 246 248 396 374
4 17.36 19.95 24.36 26.04 9 222 249 404 392
7 10.03 11.37 13.90 15.00 16 227 238 446 434
8 9.25 18.07 12.27 13.12 20 226 240 462 460
9 9.53 17.64 10.84 11.61 24 226 240 486 474
16 8.97 15.05 6.16 6.67
30 226 241 508 508
20 8.99 12.25 4.93 5.33
24 9.07 13.61 4.14 4.50 31 225 239 512 512
30 8.93 14.36 3.33 3.79 32 224 240 518 512
31 8.65 13.46 3.11 3.66 33 224 239 520 512
32 8.92 12.77 3.11 3.62 34 223 240 520 512
33 8.66 12.88 4.66 12.87
40 224 237 520 -
34 8.89 12.87 4.95 88.07
45 225 240 520 -
40 8.94 12.42 4.92 -
45 8.39 11.83 4.11 -
5
implementations of both selected applications. One exception
Power Consumption
is the matrix multiplication with 4-8 threads, where the Open
MPI implementation on the workstation exhibits the lowest
time x power product. When comparing both applications’
600
Pthread implementations on both machines, the time x power
product is better (lower) on the supercomputer for the matrix
500
multiply application (for 9 or more threads), and better on the
workstation for the heat transfer application (except for 30-32
Power
(wa9)