Parallel and Distributed Programming on Low Latency Clusters

BY VITTORIO GIOVARA B.Sc. (Politecnico di Torino) 2007

THESIS Submitted as a partial fulfillment of the requirements for the degree of Master of Science in Electrical and Computer Engineering in the Graduate College of the University of Illinois at Chicago, 2010

Chicago, Illinois

To my mother, without whose continuous love and support I would never have made it.

iii

ACKNOWLEDGMENTS

I want to thank all my family, my mother Silvana, my grandmother Nenna and my dear Tanino who help me with love and support every day of my life. Then I would like to thank all the faculty members that assisted me with this project, in particular professor Bartolomeo Montrucchio and professor Carlo Ragusa for all the time spent with me trying to make the software run, and researcher Fabio Freschi for giving me useful suggestions during development. Finally I would like to thank all my friends that were near me during these years, Alberto Grand, whose patience and kindness towards me are really extraordinary, and Salvatore Campione, who is an encouraging model for my studies.

V. G.

iv

TABLE OF CONTENTS
CHAPTER 1 INTRODUCTION . . . . . . . . . . . . . . . . . . . . . . 1.1 Evolution of parallel and distributed systems 1.2 Computer architecture classification . . . . . 1.3 Thesis Contents . . . . . . . . . . . . . . . . . . BACKGROUND . . . . . . . . . . . . . . . . . 2.1 Parallel and distributed application 2.2 Technological requirements . . . . . 2.2.1 SMP processors . . . . . . . . . . . . 2.2.2 Multithreading . . . . . . . . . . . . 2.2.3 GPGPU computing . . . . . . . . . 2.2.4 NUMA machines . . . . . . . . . . . 2.2.5 Clusters . . . . . . . . . . . . . . . . 2.3 Scientific software advance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . PAGE 1 1 4 6 8 8 9 9 10 11 12 13 14 16 16 18 21 22 24 24 25 26 27 28 29 30 31 31 33 34 35 35 37 39

2

. . . . . . . developing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

3

TECHNOLOGY . . . . . . . . . . . . . . . . . . . . . . . . . . 3.1 Parallel applications with OpenMP . . . . . . . . . 3.1.1 Amdahl’s Law . . . . . . . . . . . . . . . . . . . . . . 3.1.2 Benchmarking . . . . . . . . . . . . . . . . . . . . . . 3.1.2.1 Sequential program with OpenMP enhancements . 3.1.2.2 OpenMP schedulers performance . . . . . . . . . . 3.1.2.2.1 Static Scheduler . . . . . . . . . . . . . . . . . . . . . 3.1.2.2.2 Dynamic Scheduler . . . . . . . . . . . . . . . . . . . 3.1.2.2.3 Guided Scheduler . . . . . . . . . . . . . . . . . . . . 3.1.2.3 OpenMP enhancement results . . . . . . . . . . . . 3.2 Infiniband . . . . . . . . . . . . . . . . . . . . . . . . 3.3 Distributed execution with MPI . . . . . . . . . . . 3.3.1 MPI over Infiniband . . . . . . . . . . . . . . . . . . 3.3.2 Benchmarks . . . . . . . . . . . . . . . . . . . . . . . 3.3.2.1 Single message over Infiniband with MPI . . . . . 3.3.2.2 Multiple messages over Infiniband with MPI . . . 3.3.2.3 Latency . . . . . . . . . . . . . . . . . . . . . . . . . . ALGORITHM . . . . . . 4.1 Overview . . . . 4.2 Code Flowchart 4.3 Test Case . . . . . . . . . . . . . . . . . . . . . . . v . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

4

TABLE OF CONTENTS (Continued)
CHAPTER PAGE

4.4 4.5 4.5.1 4.5.2 4.5.3 4.5.4 5

Profiling . . . . . . . . . Compiler optimizations Native switch . . . . . . Loop unrolling . . . . . IEEE compliance . . . . Library Striping . . . .

. . . . . .

. . . . . .

. . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . .

40 41 41 43 43 43 45 45 47 47 48 50 51 52 54 54 55 58 60 61 63 65 69 72 75

IMPLEMENTATION . . . . . . . . 5.1 General Scheme . . . . . . . 5.2 Hardware Support . . . . . 5.3 Applied Directives . . . . . 5.3.1 MPI Layer . . . . . . . . . . 5.3.2 DO directive . . . . . . . . . 5.3.3 REDUCTION directive . . . . 5.3.4 Avoiding data dependency 5.4 Results . . . . . . . . . . . . 5.4.1 Reduced test case . . . . . . 5.4.2 Final test case . . . . . . . .

6

CONCLUSION . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . APPENDICES . Appendix A Appendix B . Appendix C . Appendix D . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

CITED LITERATURE . . . . . . . . . . . . . . . . . . . . . . . . . . . . VITA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

vi

LIST OF TABLES
TABLE I MAXIMUM DATA THROUGHPUT IN DIFFERENT CONFIGURATIONS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . MPI OVER INFINIBAND 0-BYTE MESSAGE LATENCY . . . . . PARTIAL RESULTS . . . . . . . . . . . . . . . . . . . . . . . . . . . . FINAL RESULTS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . FUNCTION RESULTS . . . . . . . . . . . . . . . . . . . . . . . . . . . PAGE

29 34 55 56 57

II III IV V

vii

LIST OF FIGURES
FIGURE 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 Approach levels for parallelization . . . . . . . . . . . . . . . . . . . . . . . Classification scheme of computer architecture classification . . . . . . . Image showing the tree splitting procedure of a sequential task . . . . . Graph plotting of the theoretical curve from Amdahl’s Law . . . . . . . Graph plotting of Amdahl’s Law for multiprocessors . . . . . . . . . . . Performance overview of an OpenMP threaded program . . . . . . . . . OpenMP static scheduler performance chart . . . . . . . . . . . . . . . . OpenMP dynamic scheduler performance chart . . . . . . . . . . . . . . . OpenMP guided scheduler performance chart . . . . . . . . . . . . . . . . OpenMP scheduler overview . . . . . . . . . . . . . . . . . . . . . . . . . . Time v. size for a single message . . . . . . . . . . . . . . . . . . . . . . . Time v. size for 1024 consecutive messages . . . . . . . . . . . . . . . . . Flowchart of the main functions implementated in the code . . . . . . . Standard problem #4 representation . . . . . . . . . . . . . . . . . . . . . S state field representation . . . . . . . . . . . . . . . . . . . . . . . . . . . Call graph scheme of the target software . . . . . . . . . . . . . . . . . . . Implementation scheme overview . . . . . . . . . . . . . . . . . . . . . . . PAGE 3 5 17 19 20 23 24 25 26 27 32 33 38 39 40 42 46

viii

LIST OF ABBREVIATIONS

API SMP OpenMP MPI IPC PML BTL SISD SIMD MISD MIMD SPMP MPMD SSE SSSE3 UMA NUMA

Application Programming Interface Symmetric Multi-Processing Open Multi-Processing Message Passing Interface Inter Process Communication Point-to-point Messaging Layer Byte Transfer Layer Single Instruction Single Data Single Instruction Multiple Data Multiple Instructions Single Data Multiple Instructions Multiple Data Single Program Multiple Data Multiple Program Multiple Data Streaming SIMD Extensions Supplemental Streaming SIMD Extensions 3 Uniform Memory Access Non-Uniform Memory Access ix

LIST OF ABBREVIATIONS (Continued)

GPU GPGPU

Graphics Processing Units General-Purpose computing on Graphics Processing Units

ECC LLG

Error-Correcting Code Landau-Liftshitz-Gilbert equation

x

SUMMARY

The goal of this thesis is to increase performance and data throughput of Sally3D, an electromagnetic field analyzer and micromagnetic modeler for nanomagnets, developed at “Politecnico di Torino” by the Electrical Engineering department. This target has been achieved by means of open standards, such as OpenMP and MPI, that offer robust parallel programming paradigm and an efficient message passing API; in order to reduce latency in message passing between the two machines, a point-to-point Infiniband link has been implemented. Results will be provided, showing that it is possible to achieve a 80% speed improvement thanks to optimized code, OpenMP multithreading and MPI communication. The used hardware consists of two computers with two quad-core Intel Xeon processors, running at 2.5 GHz, supplied with 32 GB of RAM and a 20 Gb/s Infiniband network card.

xi

CHAPTER 1

INTRODUCTION

1.1

Evolution of parallel and distributed systems Until some decades ago computer applications were written in a sequential style in which

the instructions were executed in a fixed order; the programs relied on a single processing unit and the throughput was dependent on the processor speed. Nowadays however the technological trend is to control processor frequency and voltage in order to consume less power and generate less heat and in this modern architecture sequential programming is not effective. For this reason a new execution paradigm has been exploited: parallel programming. Parallel computing is a simultaneous execution of operations at different levels: the most widely used form of parallelism are bit-level, augmenting the bit size of words, instructionlevel, exploiting instruction pipelines in processor architectures, loop-level, distributing data independent instructions in a loop among different cores, and task-level, using complete threads distribution among the cores. In order to be able to use parallel applications, hardware support must be present. There are many kinds of parallel-oriented computers like multi-core, single processor with many processing units, symmetric multiprocessing, a machine with more than one (multicore) processor, cluster and grid computing, closely coupled computers connected with high-end networks, and finally

1

2

graphics processing units which are used for general purpose computation and are suited for linear and array operations. On the other hand parallel applications bring some drawbacks at different levels: manually programming threads and concurrent processes is a difficult task, as data dependency must be carefully handled, and poor programming styles may lead to performance degradation. Moreover in a parallel environment several problems are introduced, such as deadlock or starvation, in which execution cannot continue due to resource dependency conflicts. Subsequently there has been an increasingly research effort to circumvent the difficulties of parallel programming, trying to achieve the automatic parallelization from the compiler. However complete automatic parallelization is a very complex operation requiring computational power that has not yet been reached; for this reasons several other approaches have been proposed. A quite simple and somewhat effective technology is loop unrolling activated by proper compiler optimizations; instead of translating a loop into a sequence of operations followed by a jump, the cycle is transformed in a completely sequential program, preventing a lot of jumps and processor flushes. This is quite beneficial for pipelined processors that present a high overhead for jump operations, but there is an increased code size proportional to the dimension of the loop and there is still exponential complexity in unrolling very large cycles. A more effective way was introduced a few years ago in which the programmers could insert hints as compiler directives: in this way it is possible to define sections of code that can be safely parallelized, exploiting the full capabilities of multicore processors. The interaction level in this

3

methodology is more advanced with respect to loop unrolling as it requires deeper knowledge of the program and of dependency between variables; however even limited insertion of compiler directives has a major effect on parallelization and program throughput. The next figure (Figure 1) shows different parallelization methodology and in-depth level approach; as it may seem obvious, full parallelization is fully achieved when it is set up as a goal during a program design, but it is possible to adapt the project during development at different stages, each requiring an action of different difficulty.

Figure 1. Approach levels for parallelization

4

1.2

Computer architecture classification As soon as parallel computation theory began to gain popularity, there was a shift in

computer architecture design and a precise classification was needed. From a single processor model that operates on a single data stream, it was possible to consider multiple or single instructions operating on multiple or single data; representation of each classification is gathered in the Flynn’s taxonomy:

SISD computers are traditional machines with a single processor operating on a single instruction (or data) stream, often stored in a single memory. This is the oldest architecture design and was the leading model in computer markets until a decade ago, when the first MMX extension was added to Intel processors. SIMD is the general modern architecture commonly found in current processors in the form of SSE, Altivec and VIS1 instructions among others; most recently GPUs have started to exceed this paradigm with emphasis on vectorial parallelization. Multimedia operations are the prime beneficiaries for this application as well as cryptography and data compression. MISD architecture is an uncommon one as there is no performance benefit from this design, but it is often found in mission critical applications, in which a dependable system must be developed. As a matter of fact operating on single data with multiple identical in-

1

Visual Instruction Set, technology present in SPARC processors.

5

structions may lead to error detection and error correction with means of hardware and time redundancy. MIMD systems are suited for computer clusters in which a shared or distributed memory is used; processors may function asynchronously and independently. Parallelism is achieved because at any time computers may be executing different instructions on different data.

Figure 2. Classification scheme of computer architecture classification

6

There might be some other classification for the MIMD class, in which the concept of “instruction” is extended to the notion of “program”:

SPMD multi processors execute the same program at the same time, but at independent points in the code while working on the different data; MPMD implementation of a client/server model in which a master feeds other nodes with data and coordinates the workload distribution; so each node executes a different set of programs on different data and reports its result to the master.

1.3

Thesis Contents In this thesis it is described how to make use of such levels of parallelization directives for

a completely serial numerical program, in order to increase computational performance over a distributed and parallel environment. For this reason a MIMD system will be exploited. The program consists in an equation solver written in FORTRAN language adapt for computation of electromagnetic field analysis, with high level plotter resolution. Since the program is already provided, it is not possible to abstract to a very high level methodology; for this reason what has been selected for parallelization technology is OpenMP which offers a set of compiler directives to extend sequential sections of code on every core of the machine. As for the distributed part of the algorithm, two technologies have been adopted: MPI and Infiniband. MPI is an high level API for performing Inter Process Communication on the same machine or on different nodes available for many different programming languages (even for those which do not have IPC mechanism capabilities). Infiniband on the other hand

7

was chosen for its outstanding performance in sending small quantities of data with very little latency. After introduction, this document will present a general background and previous work regarding parallel application methodologies, followed by a thorough description of the technologies used in this research. Then the main algorithm of the program will be outlined, showing the critical points in which a possible performance increase may be achieved through parallelization or distribution; finally some results will be submitted, tracing the throughput growth of the program with OpenMP and MPI directives.

CHAPTER 2

BACKGROUND

2.1

Parallel and distributed application developing Historically, parallel and distributed computing has been considered to be “the high end of

computing”, and has been used to model difficult scientific and engineering problems found in the real world. Some examples (source: Livermore Computing Center): • Atmosphere, Earth, Environment; • Physics - applied, nuclear, particle, condensed matter, high pressure, fusion, photonics; • Bioscience, Biotechnology, Genetics; • Chemistry, Molecular Sciences; • Geology, Seismology; • Mechanical Engineering - from prosthetics to spacecraft; • Electrical Engineering, Circuit Design, Microelectronics; • Computer Science, Mathematics; • processing of large amounts of data in sophisticated ways such as: – Databases, data mining; – Oil exploration; – Web search engines, web based business services; 8

9

– Medical imaging and diagnosis; – Pharmaceutical design; – Management of national and multi-national corporations; – Financial and economic modeling; – Advanced graphics and virtual reality, particularly in the entertainment industry; – Networked video and multi-media technologies; – Collaborative work environments.

2.2 2.2.1

Technological requirements SMP processors

As demands for performance increases and as the cost of microprocessors continues to drop, the single processor model has been abandoned in favor of an SMP organization. An SMP architecture refers a computer system composed of multiple processors connected to a single shared memory and to a shared I/O controller. Operating system support is necessary for enabling this feature. Moreover programs have to be rewritten or at least reconsidered in order to access every resource available. For this reason there has been a continuous improvement to compiler software, trying to simplify program parallelization for developers. Resorting to a SMP architecture can bring many advantages (1):

10

1. Performance – workload can be spread among more processors, running different tasks in parallel; moreover interrupt management can affect only one processor at time, avoiding processes suspension and pipeline stalls; 2. Incremental Growth – adding additional processors increases performance even more, up to a certain extent; 3. Scaling – vendors can offer more systems with different SMP configuration; 4. Transparency – the operating system hides SMP management from the user, as it handles thread scheduling and processes synchronization; 5. Availability – it is possible to set up the processor to execute the same instruction on all the symmetric processors, being able to sustain hardware failures (sort of MISD architecture).

2.2.2

Multithreading

Multithreading is a technique to exploit thread-level parallelism; unit of execution becomes a single thread of the program in memory. Once again, it is necessary to enable this feature in software, through the operating system support (2). It is possible to increase execution parallelism by using one of the following implementations:

interleaved multithreading (fine-grained ) at every clock cycle the processor switches execution from one thread to another, unless one is not ready (blocked for data dependency or memory latency);

11

blocked multithreading (coarse-grained ) instructions of the threads are continuously executed, until an event causes delay or cache miss; in that case execution is switched to another thread; simultaneous multithreading (SMT or Hyperthreading) instructions from multiple threads are simultaneously executed, exploiting intrinsic parallelism of the execution units of the processor; chip multithreading one or more processors is simulated on the physical chip, each handling separate thread sets; in this way pipeline execution is much simplified.

The Simultaneous Multithreading technique has been implemented in most modern processors as it has shown the best performance benefits in a variety of applications during testing. 2.2.3 GPGPU computing

General-purpose computing on graphics processing units refers to a technique that allows general purpose execution through the processors present in modern video cards (namely, GPUs). This methodology allows to exploit the GPU computing power, that is usually reserved for computer graphics, for almost any kind of operations; since the graphics processing unit is composed of a lot of array processors, using a GPGPU programming language enables automatic streaming execution. Applications that especially benefit from streaming execution are multimedia-related, such as digital signal processing (for audio/video or image manipulation), but there are also many implementations of computer clusters, physics simulators, mathematical solvers and raytracing

12

done with GPGPU. Moreover there is older array-based software that receives a positive impact from this rather new technology, like cryptography, DNA folding, neural networks and medical imaging. 2.2.4 NUMA machines

While general purpose processors adopt a uniform memory access (UMA), it is not uncommon to find systems whose access time is not uniform and depends on the position of the processor (NUMA, non-uniform). NUMA machines are usually physically distributed but logically shared, meaning that one node can directly access memory of another node and that not all processors have equal access time to all memories; a software layer is often needed to guarantee program access and workload distribution. Memory is mapped like a global address space, merging the linked SMP memory; this feature provides a user-friendly programming perspective to memory as data sharing between tasks is both fast and uniform due to the proximity of memory to CPUs. However there is a lack of scalability between memory and CPUs because adding more CPUs can geometrically increase traffic on the shared memory-CPU path. Moreover there is a whole synchronization construct that needs to be implemented to insure “correct” access of global memory. One final disadvantage is that it is becoming increasingly difficult and expensive to design and produce shared memory machines with ever increasing numbers of processors.

13

2.2.5

Clusters

A cluster is an alternative or an addition to symmetric multiprocessing for achieving high performance; it is possible to define “cluster” as a group of computers interconnected through some network interface, working together as unified computing resource. It is possible to create large clusters that can by far outperform any standalone machine, with the advantage that is is relatively easy to add new components, even in small increments; both clusters and SMP systems provide a configuration for high performance applications and they can both introduce advantages and disadvantages. For example an SMP system is easier to manage and has less problems in running singleprocessor software, while clusters require an in-depth program revision, with load balancing and work distribution; on the other hand, though, clusters dominate the final performance outcome and offer more solutions for availability. Clusters are historically divided in:

High-availability clusters for improving the availability offered by the cluster itself; they usually exploit redundancy so that when one node fails, it can be immediately substituted by a spare one (active or passive standby); Load-balancing clusters with the primary purpose of distributing evenly the workload of a given task or service among the rest of the cluster; Compute clusters used for computational activity, rather than services; nodes are tightly coupled and usually computation implies a consistent quantity of communication involved;

14

usually programs can be easily ported to this environment through simple instruction routines (e.g. MPI); Grid computing similar to compute clusters, they focus more on the final computational throughput rather than workload distribution and tightly coupled jobs; computation consists of many independent jobs which do not have to share data during the computation process.

2.3

Scientific software advance Using parallelization technologies such as OpenMP and MPI, is not new in scientific soft-

ware. As a matter of fact it is normal to find quite a number of projects that exploit those technologies. For example it is possible to cite the Folding@Home project, from the Stanford University’s chemistry department, currently the most powerful distributed computing cluster, which is being developed using an MPI layer between its nodes; or it is possible to find many entries from the TOP500 list1 , like the Pleiades and the Ranger that use Infiniband as connection link among the clusters. As for electromagnetic field analyzers, there has been some previous work with OpenMP: (3) and (4) describe a possible implementation for Hybrid solvers, but the addressed software has different solving and modeling routines. The proposed work doesn’t rely on standard FEM

1

project ranking and detailing the 500 most powerful known computer systems in the world.

15

approach, but takes on a Finite Formulation of nonlinear Magneto-static algorithm which can be safely parallelized and distributed; see (5) for more information.

CHAPTER 3

TECHNOLOGY

3.1

Parallel applications with OpenMP OpenMP is an application programming interface (API) that offers a set of compiler direc-

tives, library routines and environment variables to enable shared memory multiprocessing for C, C++ and FORTRAN programs. OpenMP stands for Open Multi-Processing and it is implemented in many open source and commercial compilers, like Intel C++ and FORTRAN Compilers (ifort and icc) and GNU Compiler Collection (gcc). Among the key factors for its popularity there is the easiness of handling threads and shared variables and the simplicity of porting programs to a multiprogramming scheme with very little code change; moreover OpenMP enables parallel execution control for languages that cannot usually handle multi threading and synchronization primitives, like, for instance, FORTRAN. With this technology the main program forks a set number of parallel threads which carry out a task, dividing the work load on different cores; by default every thread executes its section of code independently. After execution of the parallel job, threads are then joined back in the main (or master) thread, resuming normal sequential programming; in this way it is possible to divide the sequence of program execution in a tree-like structure (as shown in Figure 3).

16

17

Figure 3. Image showing the tree splitting procedure of a sequential task

OpenMP exploits preprocessor directives for thread creation and synchronization, workload distribution and sharing, data and function management, while retaining compatibility with unsupported compilers. In order to prevent data corruption due to overlapping threads, all variables of the parallel section must have a declared visibility scope, either shared or private. One directive is particularly suited for loop parallelization as it offers a fine-grained control on the scheduling for the threads and on the distribution of the loop among the thread pool. Other directives may directly manage thread interaction and synchronization objects (critical regions and variable locking).

18

However, it is important to clarify that using OpenMP on an N processor machine does not reduce the execution time by N. As a matter of fact there are a couple of reasons for this to apply: • Symmetric Multi Processor computer have increased computational power, but the memory bandwidth does not scale proportionally to the number of processors (or cores); performance degradation occurs especially when the shared memory bandwidth is filled up and data transfer is slowed down; • synchronization overhead, critical region management, context switch costs and load balancing among the threads may reduce the final speedup; • not every portion of the code can be actually parallelized: • the theoretical limit imposed by Amdahl’s Law for parallel applications that regulates the maximum theoretical speedup holds.

3.1.1

Amdahl’s Law

Amdahl’s Law is a method used for finding the maximum speed improvement in parallel computing environments. The speedup highly depends on the size of the parallelizable code (6). The formula states that the potential speedup of the program directly depends on the fraction of code P that can be parallelized

speedup =

1 1−P

(3.1)

19

Basically if none of the code can be parallelized, P = 0 and the speedup = 1 (no speedup), if all of the code is parallelized, P = 1 and the speedup is infinite (in theory), if 50% of the code can be parallelized, maximum speedup = 2, meaning the code will run twice as fast, and so on; the next figure (Figure 4) shows the theoretical speedup curve with infinite processors.

Figure 4. Graph plotting of the theoretical curve from Amdahl’s Law

When the code has parts that cannot be parallelized, the relationship can be updated to

speedup =

P N

1 +S

(3.2)

20

where N is the number of processors, P the portion of parallelizable code and S the portion of serial code (corresponding to (1 − P )). The following figure (Figure 5) shows a set of examples with different parallelizable code over a variable number of processors. It is possible to see not only that a 95% parallelizable program has a maximum speed improvement in the order 20x notwithstanding the high number of processors available, but also that a highly sequential program cannot achieve any acceleration whatsoever.

Figure 5. Graph plotting of Amdahl’s Law for multiprocessors

21

3.1.2

Benchmarking

In order to understand the possible benefit from using OpenMP, some tests have been run targeting the best possible configuration about the number of threads and the thread size. A simple test program was used with a complex and long loop containing some processor intensive operations (mainly mathematical operations like power and square root). The particular case of an “interesting” loop has been chosen because it showed with enough simplicity the effort/benefit ratio of OpenMP. The two main configuration variables that characterized the benchmarks were the scheduler type and the chunk size, plus the total number of threads involved in the program. The chunk size is an integer positive value representing the number of iterations each thread has to manage, while the scheduler type may be:

STATIC loop iterations are divided in fixed chunk number of iterations; DYNAMIC loop iterations are divided in chunk number of iterations, but then dynamically assigned to thread when one task is completed; GUIDED the chunk size is rearranged proportionally to its value allowing unassigned iteration to gain priority over completed tasks.

Other type of schedulers may be auto and runtime in which one of the above scheduler is selected accordingly to the CPU load and the set up environment. As it can be foretold, guidedscheduled threads work best with very small chunk sizes (with respect to the total number of

22

iterations), as the scheduling algorithm is more efficient when it can control a pool of threads on its whole, while the static and dynamic scheduling prefer having a medium chunk size value. Beware that setting a static number of threads may reduce the total performance of the application. As a matter of fact the thread number in the main program has been left to the default value for this very reason. The test program partially emulates some computationally intensive routines of the target software; the main loop is composed of several mathematical functions that are known to stress the processor and require a long cpu time to be carried out. The code is reported in appendix B. 3.1.2.1 Sequential program with OpenMP enhancements

In this first test the program is speeded up with increasingly higher number of threads available, also overcoming the eight physical cores actually present. All three scheduling algorithms are evaluated. The value of the first column (one thread) may be safely considered as reference for the program without OpenMP optimizations. It is possible to see that there is a huge impact when inserting a second thread (50% time reduction) and then it asinthotically tends to a given value, fully respecting Amdahl’s Law. It’s interesting to notice that the three schedulers perform in same range of values and that the best performance is achieved in the region of 8-9 threads (given the eight-core machines used). After this value all the schedulers, static and the dynamic in particular, suffer from excessive context switches and interference from the operating system preemption mechanism.

23

Figure 6. Performance overview of an OpenMP threaded program

24

3.1.2.2

OpenMP schedulers performance

Having evaluated the performance of the different threads, now the three types of available schedulers are compared; moreover for each scheduler a different order of chunk size is tested. 3.1.2.2.1 Static Scheduler

The static scheduler works as expected (Figure 7) showing a very good performance increase in region of 7-8 threads with 10-100 as chunk value. It is interesting to notice that for very high chunk size OpenMP can’t reduce the execution time, and this holds for every type of scheduler; the reason of this behavior resides in how OpenMP manages iterations – all iterations of the loop are assigned to a single thread and therefore there is not any benefit.

Figure 7. OpenMP static scheduler performance chart

25

3.1.2.2.2

Dynamic Scheduler

Because of its dynamic behavior, the dynamic scheduler shows very peculiar results with different configurations. For example, as shown in Figure 8, there are high chunks and little number of threads that present even an additional overhead, or small chunks that cannot leave the average value regardless of the thread number. Even with this disparity however, the best execution time reduction is achieved in region 7-9 by chunks of medium order.

Figure 8. OpenMP dynamic scheduler performance chart

26

3.1.2.2.3

Guided Scheduler

The final scheduler presented here is the most straightforward and the best performing, thanks to the more advanced algorithm of the guided scheduling. As a matter of fact for a chunk size of 1, the size of each chunk is proportional to the number of unassigned iterations divided by the number of threads, decreasing to 1. For a chunk size with value k (greater than 1), the size of each chunk is determined in the same way with the restriction that the chunks do not contain fewer than k iterations (except for the last chunk to be assigned, which may have fewer than k iterations) – source (7). As anticipated, this algorithm works best with very small chunks, as it can apply its algorithm without interferences, and always in the 8-9 threads region.

Figure 9. OpenMP guided scheduler performance chart

27

3.1.2.3

OpenMP enhancement results

This last section resumes the global results from the point of view of the scheduler. As reference value, the maximum time execution reduction has been selected from each chunk of each scheduling algorithm; all these results come from the 7-9 threads region. The test run shows that the scheduler that performed best is the guided scheduler with chunk size in the order of the units, and for this reason it has been chosen as default scheduler in all OpenMP directives inserted.

Figure 10. OpenMP scheduler overview

28

3.2

Infiniband Infiniband is the union of two competing transport designs, Next Generation I/O from Intel,

Microsoft and Sun, and Future I/O from Compaq, IBM and Hewlett-Packard. It has become the de facto standard for high speed cluster interconnection, outperforming Ethernet in both transfer rate and latency. This technology implements a modern interconnection link using a point-to-point bidirectional serial transfer, supporting several signaling rates. It is used for high-performance computing either for high-speed connection between processors and peripherals as well as for low-latency networking. The standard transmission rate is of 2.5 Gbit/s, but double and quad data rates currently achieve 5 Gbit/s and 10 Gbit/s respectively. Moreover it is possible to join links in units of 4 or 12 elements enabling even further transfer speed (up to 120 Gbit/s). However it is important to state that a fault prevention for transmitted data is adopted using information redundancy: every 10 bits sent carry only 8 bits of useful information, reducing the useful data transmission rate. Table I summarizes the various configuration effective data rate. Most notably, there is no standard programming interface for the device, only a set of functions (referenced as verbs) must be present, leaving implementation to the vendors. The most commonly accepted implementation is provided by the OpenFabric alliance. Being a transport layer there are many protocol that can run on Infiniband, from TCP/IP to OpenIB (described in section 3.3.1).

29

TABLE I MAXIMUM DATA THROUGHPUT IN DIFFERENT CONFIGURATIONS useful data 1X 4X 12X raw data 1X 4X 12X Single Data Rate 2 Gbit/s 8 Gbit/s 24 Gbit/s Single Data Rate 2.5 Gbit/s 10 Gbit/s 30 Gbit/s Double Data Rate 4 Gbit/s 16 Gbit/s 48 Gbit/s Double Data Rate 5 Gbit/s 20 Gbit/s 60 Gbit/s Quad Data Rate 8 Gbit/s 32 Gbit/s 96 Gbit/s Quad Data Rate 10 Gbit/s 40 Gbit/s 120 Gbit/s

3.3

Distributed execution with MPI MPI is a high level language-independent API used both for parallel computing and for one-

to-one, one-to-many and many-to-many inter process communication (IPC). It has become the de facto standard for process communication despite of lack of sponsorship by any association. Originally it was developed by William Gropp and Ewing Lusk among others. This set of API is used for high-performance computing for its scalability, portability and performance, as it implements a distributed shared memory system with very few directives. It usually resides on level 5 of the OSI model, but, as there is no strict constraint on this point, there are many implementation that offer different transport, network and data link layers. MPI is available for many programming languages including C, C++, FORTRAN and Java; sometimes implementations benefit from the bounded language, for example using objectoriented programming in C++ and Java, and from the hardware they run on. Among the

30

most diffused library it is possible to find OpenMPI, MPICH2 and MVAPICH2 which differ only for threading support, network availability (e.g. Ethernet or Infiniband) and hardware optimizations. 3.3.1 MPI over Infiniband

One of the most widely used environments for MPI is Infiniband; as a matter of fact thanks to Infiniband low latency a small packet sent through a connection link doesn’t present a major overhead with respect to Ethernet for example. In order to set up a distributed system of this kind there is need of additional software for managing the Infiniband sub net (OpenSM) and for handling the transport layer (OpenIB). MPI and Infiniband modularity allow different configurations, and it is common use to transmit packet with either Infiniband or a TCP/IP stack. This is possible because the transport layer of MPI is handled by two routines (among others): the Point-to-Point Messaging Layer and the Byte Transfer Layer. The PML abstracts the communication mechanism with buffers, synchronization points and acknowledge messages; the BTL on the other hand translates the byte messages into the network layer byte sequence – OpenIB is a BTL protocol for sending messages on Infiniband. Subsequently the functions (or verbs) available in the Infiniband drivers are invoked and control is moved from user space to kernel space, where the message is finally sent across the network link. This seemingly complex structure allows to reduce code complexity and increase intercompatibility and maintainability between different implementations.

31

3.3.2

Benchmarks

As it has been done with OpenMP, some tests were also performed on the MPI installation and on Infiniband structure to check that machine configuration was correct and that devices were running at full speed. The program makes heavy use of the MPI Send and MPI Recv directives and utilizes timing function with resolution of milliseconds. It has been noticed that a warm-up phase (exchanging some messages between the nodes) is necessary before any measurement is done, because the whole structure of MPI plus Infiniband must be activated. 3.3.2.1 Single message over Infiniband with MPI

In this test the transfer time of messages over Infiniband with MPI directives is evaluated; message size increases quadratically and time is measured with millisecond precision. Data is displayed in a semi-logarithmic scale so that the whole slope can be shown. Two different MPI implementation are compared, and it possible to notice that OpenMPI outperforms MVAPICH in small and large quantities of data, but it is slower in medium-sized messages. With MVAPICH it is not possible to send data over 2 GB, due to implementation limits; OpenMPI doesn’t suffer from this behavior, but on the other hand it has a sort of latency of 3.5 seconds before programs start executing (and this is not recorded in this test). Other types of MPI implementation exist, most notably MPICH and Lam-MPI, from which both MVAPICH and OpenMPI derived, but they lack of support for Infiniband; any packet transmitted would revert to plain TCP/IP.

32

Figure 11. Time v. size for a single message

33

3.3.2.2

Multiple messages over Infiniband with MPI

Using the same structure of above, here is tested the time v. size with multiple messages (1024 messages exchanged for each tested size). The results are similar to the previous case.

Figure 12. Time v. size for 1024 consecutive messages

34

3.3.2.3

Latency

One final test has been run to determine the expected latency in message passing; this has been achieved by sending a 0-length packet using some data types available in MPI. However, due to the modularity of the MPI over Infiniband structure, the MPI initialization overhead must be removed: for this reason the same test is to be repeated both on a single machine and on the two machines. The latency value measured with this method is 8 µs which is compatible with the Infiniband board specifications. The complete table of results follows.

TABLE II MPI OVER INFINIBAND 0-BYTE MESSAGE LATENCY Test type Single node Two nodes Latency µ-seconds 26 34 8

CHAPTER 4

ALGORITHM

4.1

Overview The target application is a suite of programs called Sally3d, and it has been ported from a

VMS system to standard FORTRAN, with a standard makefile instead of terminal scripts and it can be compiled on any UNIX based operating systems. The software is designed for electromagnetic field analysis and micromagnetic modeling of nanomagnets; for this purpose, magnetization dynamics in nanomagnets is described by the Landau-Lifshitz-Gilbert (LLG) equation which rules the gyromagnetic precession of magnetization vector field around the so-called micromagnetic effective field. The effective field takes phenomenologically into account the interactions occurring in magnetic materials such as short-range (exchange, anisotropy) and long-range interactions (magnetostatics, Zeeman). Magnetization dynamics in a ferromagnetic body is described by the following Landau-Lifshitz-Gilbert (LLG) equation:

∂m ∂m = −m × heff [m] − α , ∂t ∂t

(4.1)

where m = m(r, t) is the magnetization vector field normalized to the saturation magnetization, Ms , time is measured in unit of (γMs )−1 (γ is the absolute value of the gyromagnetic

35

36

ratio), α is the dimensionless damping parameter, heff [m(r, t)] is the effective field operator which can be obtained by the variational derivative of the free energy functional:

heff [m] = −

δgL [m] , δm

(4.2)

where

gL [m] =

1 VΩ

2 lex 1 | m|2 − hm · m + ϕ(m) − ha · m dV , 2 2

(4.3)

ϕ(m) is the anisotropy energy density and lex =

(2A)/µ0 Ms2 is the exchange length (A

is the exchange constant and µ0 the vacuum permeability), hm and ha are the demagnetizing and applied fields, respectively, and VΩ is the body volume. In addition, the homogeneous Neumann boundary condition ∂m/∂n = 0 is imposed at the body surface. In order to obtain a spatially discretized version of eq. (Equation 4.1) a partition of the region Ω in N cells Ωk , with volume Vk is considered and is assumed that the cells are small enough that the vector fields m(r, t) and heff [m(r, t)] can be considered spatially uniform within each cell. Symbols mk (t) and heffk denote the vectors associated with the generic k-th cell. Beside the cell vectors, the mesh vectors m = (m1 , . . . , mN )T ∈ R3N containing the whole collection of cell vectors are defined. Now it is possible to write down the discretized LLG equation in the following form that consist of a system of ordinary differential equations:

37

dmk dmk = −mk × heffk [m] + αmk × , dt dt

(4.4)

where mk is the average magnetization of the k-th cell. It is worth noting that the effective field in the k-th cell depends on the magnetization of the whole cell collection due to the magnetostatic interaction, namely heffk = heffk [m]. The numerical solution of equation (Equation 4.4) will provide the time evolution of magnetization. 4.2 Code Flowchart The kernel of the micromagnetic solver integrates over time the LLG equation discretized with respect to space. At every time step, the next value of the magnetic vector is computed by collecting the different finite elements of the magnetic field; this operation is performed by the GILBERT routine and it is reported in Figure 13. The equation is a non linear differential equation, whose solution is obtained through Newton-Raphson method of approximation; this is performed by the GINT function. The section of code which has been parallelized and distributed (outlined with yellow in the next figure) implements the magnetostatic and anisotropic field solvers; also the part that combines together the different field elements has been updated with OpenMP and MPI directives. This development scheme has been chosen on the grounds that the real computational bottleneck resulted particularly in the magnetostatic solver and partially in the anisotropic solver.

38

Figure 13. Flowchart of the main functions implementated in the code

39

4.3

Test Case In order to carefully analyze the performance of the program and to identify the possible

parallelization points, as well as to obtain useful data, a particular test was prepared. The test case is the fourth standard problem of micromagnetics, proposed by Bob McMichael, Roger Koch and Thomas Schrefl. Quoting (8), the problem focuses on dynamic aspects of micromagnetic computations. The initial state is an equilibrium s-state (Figure 15) which is obtained after applying and slowly reducing a saturating fild along the [1,1,1] direction to zero. Fields of magnitude sufficient to reverse the magnetization of the rectangle are applied to this initial state and the time evolution of the magnetization are examined as the system moves towards equilibrium in the new fields. The problem will be run for two different applied fields. At t = 0 one field will be applied to the equilibrium s-state: the field is composed of µHx = −24.6 mT, µHy = 4.3 mT, µHz = 0.0 mT (corresponding to approximately 25 mT, directed 170 degrees counterclockwise from the positive x axis).

Figure 14. Standard problem #4 representation

40

The problem was chosen so that resolving the dynamics should be easier for the 170 degree applied field than for the 190 degree applied field. Preliminary simulations reveal that, in the case of the field applied at 170 degrees, the magnetization in the center of the rectangle rotates in the same direction as at the ends during reversal. In the 190 degree case, however, the center rotates the opposite direction as the ends resulting in a more complicated reversal. The field amplitudes were chosen to be about 1.5 times the coercivity in each case.

Figure 15. S state field representation

4.4

Profiling Thanks to the standardization of the program code, it was possible to exploit the gprof

utility, available in the gcc suite. This utility allows to obtain procedure level timing information with reasonable resolution, as well as a complete call graph view for identifying the most computational expensive functions. According to the profiler, whose graph call has been reported in Figure 16, the following functions were the most time consuming:

41

• calc intmudua • curledge and the calling calc hdmg tet • calc mudua • campo effettivo

Most of the software is composed of very small routines that are called with very high frequency, thus very difficult to optimize and to measure (in fact they are not even reported in profiler reports); only the noted functions have an observable impact on the overall execution time. 4.5 Compiler optimizations Once again, due to the porting operation that has been performed, several compiler optimizations became available and were subsequently added in order to increase the throughput of the program. Most of the additions have been chosen following official gcc documentation and manual pages. 4.5.1 Native switch

The key for optimization relies on the native machine capabilities; in order to activate at once all the features of a given architecture and of a given processor is required to set -march=native. In this way all processor specific instructions can be accessed and all floating point capabilities fully exploited, setting the right processor architecture and the available SSE flags. Moreover the floating point instructions are specifically set to use any SSE extension (-mfpmath=sse enabled by default).

42

Figure 16. Call graph scheme of the target software

43

A similar optimization is achieved also in the Intel FORTRAN Compiler with the -axS -xS switches. 4.5.2 Loop unrolling

Among the loop transformation techniques, loop unrolling has achieved wide success in compiler theory. Its goal is to increase the execution speed of the program at the expense of size. Loop unrolling is performed by reducing (if not eliminating) the number of the “end of loop”; in this way the number of jumps and of conditional branches decreases, and thanks to the larger, size the number of cache hits increases (in big caches). This optimization is pulled in by the -O3 flag. 4.5.3 IEEE compliance

Due to the highly mathematical nature of the software, the -ffast-math flag has been added: this flag activates a set of optimization that allow some general speedups by discarding some return codes and by skipping some redundant operations (like rejecting the sign of zero or not considering Nan and +-Inf number types). The main drawback to this optimization is that it is not possible to guarantee IEEE, ISO and or ANSI compliance that specify arithmetic compatibility, exceptions and operand order in floating point operations. 4.5.4 Library Striping

One final type of optimization has been inserted at linking time. The following options try to decrease load time for library functions, modifying the executable header (ELF in this

44

context) and symbol handling (9). These options must be passed with the -Wl flag so that the compiler can forward them to the linker. More specifically the -O1 switch performs in this way: as symbols get inserted in the ELF header, they are stored in hash tables; the default configuration is to keep the hash keys small, performing string comparison with collisions. This optimization shifts the reduction towards short hash chains, increasing hash keys length and header size, but actually reducing symbol look-ups.

CHAPTER 5

IMPLEMENTATION

5.1

General Scheme Analyzing the functions of 4.4 from several profiling sessions a common pattern has been

found. As a matter of fact, every function contained one or more loops, carrying quite a number of instructions over arrays and matrices. For this reason a general plan has been decided and summed up in Figure 17. As first step, the standard sequential loop is parallelized to fully exploit all the eight cores each single machine can offer. By setting up proper shared/private variables lists, the loop is divided among a given number of OpenMP threads and each carries out a portion of that iteration; as soon as a thread ends, a new one is created and assigned a element, until the whole loop section is completed. The second step in this strategy is to split in two distinct and equal parts before exploiting OpenMP. Each part is submitted to a node of the cluster and separately executed; at the end of the loop data is exchanged back with MPI and merged so that the two machines can continue working on complete arrays. Thanks to Infiniband, latency for exchanged data sets is reduced to a minimum.

45

46

Even though OpenMP requires little software modifications, in order to obtain the maximum possible throughput from the software, some updates have been carried out, mainly reducing portions of redundant code.

Figure 17. Implementation scheme overview

It should be noted, however, that the software is not embarrassingly parallel; as a matter of fact there were a number of modification to the software in order to apply parallelization and

47

distributed computing. The synchronization object mostly used is the implicit blocking offered by the send() and recv() mechanism; since data is exchanged between the two machines in the same manner, until either of them is ready to process data, the other cannot continue. In other sections of the code, synchronization was achieved by native OpenMP directives, as shown in 5.3.4. 5.2 Hardware Support The hardware selected for implementing the cluster consists of two computer, each supplied with: • two quad core Intel Xeon E5420 running at 2.5 GHz frequency, with 6 MB of L2 cache; • an Intel Server Board S5000PSLSATAR motherboard; • 32 GB of ECC DDR2 RAM; • one Infiniband card from Mellanox, model ConnectX IB MHGH28-XTC DDR HCA PCI-e 2.0 x8 Memory Free. The two machines are connected together with an end-to-end Infiniband link, running at full speed as the cards are mounted on the PCI Express x16 v1.1 slot. The focus for building these computer has been to search for low-cost components that could enable high performance results. 5.3 Applied Directives In this section some example code has been extracted from the source of the program and explained.

48

5.3.1

MPI Layer

The following sections of code show some sample “header” and “epilogue” MPI functions that enable slitting the array and merging it back. The header part analyzes the rank variable which differs for every node of the MPI cluster: inside the if clause the array range is defined by setting start INDEX and end INDEX variables (which intuitively represent the range beginning and ending). So the first node works on the first half of the array and the second node on the second half, allowing both machines to operate on separated data subsets. Some preprocessor directives have been inserted in order to maintain compatibility on non MPI system.

#ifdef MPI_ENABLED if (rank .eq. 0) then start_INDEX = 1 end_INDEX = NEDGE/2 else if (rank .eq. 1) then start_INDEX = ( NEDGE/2 ) + 1 end_INDEX = NEDGE endif #else start_INDEX = 1 end_INDEX = NEDGE #endif

49

DO M=start_INDEX,end_INDEX [...] So after loop has terminated, the array on which the iteration worked must be synchronized on both nodes; this is done with a couple of MPI SEND and MPI RECV instructions. The rank variable is checked again to be able to tell which portions of the array must be updated. #ifdef MPI_ENABLED tag = 1 ISIZE = NEDGE - NEDGE/2 if (rank .eq. 0) then dest = 1 source = 1 call MPI_RECV(BINTMU( (NEDGE/2) + 1), ISIZE & source, tag, MPI_COMM_WORLD, stat, err) call MPI_SEND(BINTMU, & NEDGE/2, MPI_REAL8, dest, tag, , MPI_REAL8,

MPI_COMM_WORLD, err)

else if (rank .eq. 1) then dest = 0 source = 0 call MPI_SEND(BINTMU( (NEDGE/2) + 1), ISIZE , MPI_REAL8, & dest, tag, MPI_COMM_WORLD, err) call MPI_RECV(BINTMU, NEDGE/2, MPI_REAL8, source, tag,

50

& endif #endif

MPI_COMM_WORLD, stat,err)

5.3.2

DO directive

The DO directive is the most common in this configuration. It requires a list of shared and private variables: for the latter case, a new memory position is allocated for each thread. Workload is distributed accordingly to the selected scheduler as described in 3.1.

!$OMP

PARALLEL SHARED(IFAEXT,BINTMU,AMAG,TM)

!$OMP& PRIVATE(I,KH,KK,NPOS,IMAG,KCOMP) !$OMP DO SCHEDULE(GUIDED) DO I=start_INDEX,end_INDEX [...] BINTMU(I)=BINTMU(I)+ [...] ENDDO !$OMP END DO !$OMP END PARALLEL IFAEXT(I,KH,2)*TM(NPOS+(IMAG-1)*3+KCOMP)*AMAG(IMAG,KCOMP)

51

5.3.3

REDUCTION directive

One of the possible benefits in parallelization is to use a mathematical property for addition and subtraction clauses: since variating the order doesn’t change the result, the reduction directive allows to execute out-of-order loop instances and to compute the final value at the end of the iteration. Without this directive the target variable could have suffered from various synchronization problems, as reading and writing to a shared position doesn’t guarantee a correct result.

!$OMP

PARALLEL PRIVATE (M,K,DOT)

!$OMP& SHARED(H_DEMG,AMAG,VOLTET,NPNMAG) !$OMP& REDUCTION(+:VOLUME) !$OMP& REDUCTION(-:DEMG_ENE) !$OMP DO SCHEDULE(GUIDED) DO M=1,NPNMAG DOT=0.D0 DO K=1,3 DOT=DOT+H_DEMG(M,K)*AMAG(M,K) ENDDO VOLUME=VOLUME+VOLTET(M) DEMG_ENE=DEMG_ENE-VOLTET(M)*DOT/2.D0 ENDDO !$OMP END DO

52

!$OMP END PARALLEL

Unfortunately this option is available for non-array operators only, so it has been applied few times. 5.3.4 Avoiding data dependency

One of main problems of OpenMP and parallel programming in general is data dependency and this is usually resolved by modifying the algorithm structure or by means of synchronization objects. In order to avoid inserting a critical region (corresponding to a CRITICAL or ATOMIC OpenMP directive) for shared constructs which could have negatively affected performance, an array with self data references has been converted into a matrix and indexed with the working thread number; in this way every array element of the matrix was automatically dereferenced from itself as there could only be one single thread working on a given line at the same time.

!$OMP

PARALLEL DEFAULT(PRIVATE)

!$OMP& SHARED [...] #ifdef _OPENMP INUM_TH = omp_get_num_threads() #endif [...] DO L=1,6 LATO=(MCNT_E(L,ITET))

53

AUS=SIGN(1,LATO)* & ( FMDUA(MATFE(L,1)) - FMDUA(MATFE(L,2)) ) LATO=ABS(LATO) #ifdef _OPENMP INUM = omp_get_thread_num()+1 #else INUM = 1 #endif AMUDUAW(INUM ,LATO) = AMUDUAW + (INUM, LATO) + AUS ENDDO [...] !$OMP END DO !$OMP END PARALLEL At the end of operation, the original array is rebuilt with a simple loop on the number of generated threads (known in INUM TH). DO ILATO=1, NEDGE #ifdef _OPENMP DO III=1, INUM_TH #else III=1

54

#endif AMUDUA(ILATO) = AMUDUA(ILATO) + AMUDUAW(III,ILATO) #ifdef _OPENMP ENDDO #endif ENDDO

5.4 5.4.1

Results Reduced test case

During development the test case was run to understand if the current implementation was providing good results. The simulation had duration of 8 ps only and was composed of just 1000 elements (see Figure 14), but it was already possible to notice some good improvements to the software. Further work has been done after these results were produced. The following table (Table III) resumes the total execution time in seconds; in the table the label OMP stands for OpenMP, MPI for OpenMPI over Infiniband and OPT for optimizations, while for each field a * stands for enabled and a - for disabled. It is possible to notice that the software has received a speed boost of 87.5% from the old configuration to the newer optimized MPI over Infiniband plus OpenMP environment. Not surprisingly the most effective contribution to the software is the optimizations section: this is because the ability to access all the SSE extensions with the loop unrolling configuration (see 4.5) adds some SIMD execution to the software already.

55

TABLE III PARTIAL RESULTS OMP * * * * MPI * * * * OPT * * * * seconds 133 400 186 487 200 792 246 1062

However it is important to take in consideration what targets had this project. It is true that the most cumbersome code for the processor has been dutely parallelized, but the software is composed of a high number of other functions that are either closely serialized or with very small duration time. The sections that have been parallelized and distributed have received a speed boost, but the final software performance suffers from the presence of serial code and from the high number of small functions. This explains also why the optimizations bring such an improvement, as they affect all the software without distinction. So a more sensible comparison can only be done if the optimization element is kept constant. 5.4.2 Final test case

With the analysis of the previous data, it was possible to understand what was really needed to be measured and to be improved, so development continued focusing on the new ratio. In

56

the end, when all the most computational-expensive functions were addressed, it was possible to launch the final test case with the same characteristics of before and to obtain the following results:

TABLE IV FINAL RESULTS OMP * * MPI * * seconds 59 129 174 249

The total speed improvements of OpenMP and MPI elements only correspond to a raw 76% increment. This is very good results, because not only it is comparable to the speedup introduced by the optimizations, but also it out does the results obtained from the Intel FORTRAN compiler v10 (obtained through other tests) by a rough 23%. By looking at the single functions contribution more in detail, it is possible to see the effect of OpenMP and MPI over Infiniband with no overhead from the other routines. From the above table it is possible to understand the actual impact of the technologies used to increase the throughput of the software.

57

TABLE V FUNCTION RESULTS Function Name calc intmudua calc hdmg tet calc mudua campo effettivo Normal 24.5 s 16.9 s 12.1 s 17.7 s OpenMP 4.7 s 3.0 s 1.9 s 4.5 s MPI 14.4 s 10.8 s 7.0 s 9.9 s OpenMP+MPI 2.8 s 1.7 s 1.1 s 2.3 s

Having a look at the OpenMP section, there is an aggressive reduction, by a factor of 6-8x: this is a very good result as it means that the code was able to exploit every processor available to the maximum extent, with very little overhead and no synchronization problems. As for MPI on the other hand, there is a 2x factor of speed improvement; this is sensible as the code was almost split in two, so it is normal that the overall reduction corresponds to half execution time. It is interesting to notice that this effect applies perfectly when merging MPI with OpenMP. As a matter of fact, thanks to the Infiniband channel used, communication time is negligible, and so only the small MPI overhead can influence execution.

CHAPTER 6

CONCLUSION

In this thesis, it has been demonstrated that to achieve best results, a complete review of the software must be taken into account. Highly serialized software, written thoughout a mathematical model, must be reorganized to allow better parallelization. However there are technologies that can have a direct impact on performance, in particular OpenMP and MPI. With very little software modification and simple code analisys, it has been possible to introduce a significant improvement in the overall execution time. Furthermore the standard, clean and stable environment of the GCC suite enabled accessing important optimization controls that increased the quality of the software where it was not been done with OpenMP or MPI. For this reason this project shows significant room for improvement. First of all, algorithm optimization are necessary to obtain high performance; secondly it could be possible to take advantage of FORTRAN library functions for otherwise long routines – even more for the high number of small operations repeated several times. In the third place, software analysis must continue in order to extract precise timing information from profiling and to identify the other computational-expensive functions that could receive a significant improvement from the inclusion of OpenMP and MPI directives. Finally thanks to the high scalability of cluster system, it should be fairly easy and much convenient to add new elements that can contribute in the computation deployment; in fact it 58

59

would be possible to connect more components to the cluster using an Infiniband switch, at the sole cost of some increased latency. In fact due to the applied middleware of open standards, OpenMP and MPI, porting software to other architectures and expanding its routines to use further nodes of the clusters should not be considered a complex task.

APPENDICES

60

61

Appendix A

STREAMING SIMD EXTENSIONS

Introduced by Intel in its line of Pentium III processors, SIMD technology allows for SIMD execution. While older processors could only process one data element per instruction, SIMD technology allows instructions to handle multiple data elements, making processing much quicker. SSE’s use of SIMD technology allows for data processing in applications such as 3D graphics to benefit greatly from the availability of extended floating point registers. In contrast to the preceding MMX technology, SSE registers have an increased width, allowing more bits to be stored and more speed facilities for applications. Initially eight new 128-bit registers known as XMM0 through XMM7 were added; SSE2 extends MMX instructions to operate on XMM registers, allowing the programmer to completely avoid the eight 64-bit MMX registers “aliased” on the original floating point register stack. More precisely SSE2 adds new mathematical instructions for double-precision (64-bit) floating point and also extends MMX instructions to operate on 128-bit XMM registers. SSE integer instructions introduced with later extensions would still operate on 64-bit MMX registers because the new XMM registers require operating system support (this behavior changed only with SSE4 onward). SSE2 enables the programmer to perform SIMD math of virtually any type (from 8-bit integer to 64-bit float) entirely with the XMM vector-register file, without the need to touch the obsolete MMX/FPU registers.

62

Appendix A (Continued)

SSE3, SSSE3 and SSE4 are further revisions to the architecture and introduce new operating conditions (column access to registers), new instructions (that can act on 64-bit MMX or 128bit XMM registers and simplify implementation of DSP and 3D code) and conversion utility that avoid pipeline stalls. In a multi-tasking environment, the Streaming SIMD Extensions require support from the operating system: the SIMD registers must be handled properly by the operating system’s context switching code. When the system switches control from one process to another, the old process’s SIMD registers must be saved away, and the saved values of the new process’s SIMD registers must be loaded into the processor. The Pentium III processor prohibits programs from using the Streaming SIMD Extensions unless the operating system tells the processor at system startup time that it is aware of the SIMD registers, and will manage them properly.

63

Appendix B

OPENMP TEST PROGRAM

The test program has been designed to simulate some computationally intensive routines of the target software; in the main loop a lot of mathematical functions are executed over a set of arrays, without creating data dependencies between the iterations. Statistics are printed at the beginning and at the end of the program; in order to obtain the total execution time the function gettimeofday() is used; the loop is repeated ten times obtaining a more

for (u=1;u<=32; u++) { printf("%d threads\t", u); omp_set_num_threads(u); totaltime = 0;

for (t = 0; t< 10; t++){ gettimeofday (&timing_start, NULL); #pragma omp parallel for \ shared (a, b, c, d, f, g, h, chunk) private (i, u, t) \ schedule (guided, chunk) for (i=0; i < N; i++){ c[i] = pow (sqrt(a[i] * b[i] / (b[i] + a[i]) ), 3);

64

Appendix B (Continued)

d[i] = sqrt (c[i] * b[i] / (c[i] + pow (a[i], 4) ) ); e[i] = pow (pow (c[i], d[i]), pow (d[i], c[i]) ); f[i] = sqrt (pow (a[i], c[i] + b[i]) ); h[i] = a[i] * b[i] * c[i] * d[i] * e[i] * f[i] * g[i] * h[i]; } gettimeofday (&timing_end, NULL); totaltime+= (timing_end.tv_sec - timing_start.tv_sec) * 1000000 \ + (timing_end.tv_usec - timing_start.tv_usec); }

printf("%d\n", totaltime/10); }

65

Appendix C

OPENMP FORTRAN REFERENCE

The general OpenMP directive begins with !$OMP indicating the starting of an OpenMP configuration; any directive has to be declared with an entry and a closing section, such as:

!$OMP directive [clause ...] [...] !$OMP END directive

The first directive must be PARALLEL, which wraps the code section that must be executed in parallel, and it is closed by the corresponding END PARALLEL. Syntax for this directive (clause) may be any combination of the following:

IF (condition) parallel execution is activated only when condition is met; PRIVATE (list) list of private variables; SHARED (list) list of shared variables; DEFAULT (type) type of visibility for variables not listed before; FIRSTPRIVATE (list) list of private variables that are automatically initialized; REDUCTION (operator: list) performs an out-of-order operation of kind operator on the variable list; COPYIN (list) for copying values of variable list among threads;

66

Appendix C (Continued)

NUM THREADS (num) statically set the number of threads to generate. Another important OpenMP directive is DO that specifies the next loop can be executed in parallel by the thread team. Syntax follows: SCHEDULE (type [, chunk )] describes how iterations of the loop are divided (in chunk s) among the threads; ORDERED performs iteration in order, sequential style; PRIVATE (list) list of private variables; FIRSTPRIVATE (list) list of private variables that are automatically initialized; LASTPRIVATE (list) list of private variables that are initialized when iteration ends; SHARED (list) list of shared variables; REDUCTION (operator | intrinsic : list) performs an out-of-order operation of kind operator (or intrinsic function) on the variable list; COLLAPSE (n) performs some loop collapsing (for n loops). Other parallelizing directives that don’t require any particular clause configuration are: • SECTIONS: statically splits the code into sections which are assigned each to a single thread in the pool; • WORKSHARE: divides the execution of the enclosed code block into separate units of work; • TASK: defines an explicit task, which may be executed by the encountering thread, or deferred for execution by any other thread in the team.

67

Appendix C (Continued)

As for synchronization management, it is possible to find the following directives, which don’t need any other clause as well:

• MASTER: specifies a region of code that is executed only by one thread; • CRITICAL: identifies a critical region in which only one thread at a time can access; • BARRIER: implements a barrier region where execution is stopped until all threads are ready to continue; • ATOMIC: defines a single instruction critical region, in which memory is accessed atomically from all the threads.

Finally it is possible to use some OpenMP related functions to further adapt the software to a multiprogrammed system; this set of routines may be used for a variety of application such as obtaining information from single threads, setting configuration about the number of threads, getting environment data (like number of processors), locking variables, timing and so on. For example:

• OMP SET NUM THREADS(): sets the number of threads that must be started; • OMP GET NUM THREADS(): returns the number of threads of the parallel region; • OMP GET THREAD NUM(): returns the number identifying a single thread in the pool; • OMP GET THREAD LIMIT(): returns the maximum number of OpenMP threads available to a program; • OMP GET NUM PROCS(): returns the number of processors that are available to the program;

68

Appendix C (Continued)
• OMP INIT LOCK(): initializes a lock on the variable, setting the lock to “unset”; • OMP DESTROY LOCK(): eliminates the lock on the variable; • OMP SET LOCK(): sets the lock on the given variable; • OMP UNSET LOCK(): unsets the lock on the given variable; • OMP TEST LOCK(): tests the lock on the given variable.

69

Appendix D

MPI FORTRAN REFERENCE

MPI routines are added to a standard FORTRAN program by including the mpif.h library. After this, the MPI layer must be initialized with MPI INIT(), before using any MPI related functions, and it must be closed with MPI FINALIZE(), before ending the program. By using: call MPI_COMM_RANK(MPI_COMM_WORLD, rank, ierr) call MPI_COMM_SIZE(MPI_COMM_WORLD, numtasks, ierr) the program becomes aware of running in a MPI environment, as the reported functions save the number of the instance of the program in the rank variable, and the total number of instances created in numtasks, respectively. Then now it is possible to use point-to-point communication routines that are present in many variants, like blocking, synchronous, non-blocking, buffered, and they are all described by the following: call MPI_SEND(buffer, quantity, type, destination, tag, MPI_COMM_WORLD, ierr) call MPI_RECV(buffer, quantity, type, source, tag, MPI_COMM_WORLD, stat, ierr) The syntax for the equivalent functions is similar: buffer : represents either data that has be sent or the memory location in which it must be saved;

70

Appendix D (Continued)

quantity : tells how much data of type is sent in the message; type : sets one of the high MPI data types for the transfer; destination/source : describes the number of the instance of the program that has to send or receive the buffer; tag : identifies the message number that must be sent or received; MPI COMM WORLD : reads from the macro in which MPI configuration is saved; stat : represents the status of the transfer; ierr : is the error variable in case communication fails.

MPI allows also for collective communication (a sort of “multicasting”) by means of functions such as:

• MPI BCAST() a message is sent to all the nodes; • MPI SCATTER() a message is split and sent to all the nodes; • MPI GATHER() a message is received from all the nodes; • MPI ALLTOALL()Each task in a group performs a scatter operation, sending a distinct message to all the tasks in the group

that require information about the data buffers of both the sender and the receiver. In order to compile an MPI-enabled program, it is not possible to directly call the compiler, but it is necessary to resort to the wrapper of the MPI distribution, which correctly set paths and libraries; also for launching executables a special wrapper must be used with proper syntax.

71

Appendix D (Continued)

In case of OpenMPI, the MPI implementation selected for this project, the compiler is called mpif90 while the launching wrapper is mpirun; this software must be called specifying the number of instances of the program to run (-np) and the list of hosts that have to execute it (-host). So for example in a two-machine cluster environment in which each node has to execute an instance of the program, the correct syntax is

$ mpirun -np 2 -host host1,host2 program [args]

It is possible to share some environment variables among the nodes with the -x switch; this is required for an OpenMP+MPI system as the number of threads depends on the value of OMP NUM THREADS. The resulting command line instruction is:

$ mpirun -np 2 -host host1,host2 -x OMP NUM THREADS program [args]

CITED LITERATURE

1. Stallings, W.: Computer Organization & Architecture - Designing for Performance. Pearson - Prentice Hall, 2006. 2. Hennessy, J. L. and Patterson, D. A.: Computer Architecture: A Quantitative Approach. Morgan Kaufmann, 1990. 3. Lu, J., Li, Y., Sun, C., and Yamada, S.: A parallel computation model for nonlinear electromagnetic field analysis by harmonic balance finite element method. Technical Report 0-7803-2018-2, Faculty of Science and Technology, Griffith University Australia and Faculty of Technology, Kanazawa University Japan, 1995. 4. Ito, F. and Amemiya, N.: Application of parallelized SOR method to electromagnetic field analysis of superconductors. Technical Report 1051-8223/04, Faculty of Engineering, Yokohama National University, 2004. 5. Giuffrida, C., Gruosso, G., and Repetto, M.: Finite formulation of nonlinear magnestostatics with integral boundary conditions. Technical Report 0018-9464, Electrical Engineering Department, Politecnico di Torino and Electronic and Information Engineering Department, Politecnico di MIlano, 2006. 6. Silberschatz, A., Galvin, P. B., and Gagne, G.: Education, 2006. Operating System Concepts. Pearson

7. Barney, B.: OpenMP. Lawrence Livermore National Laboratory, https://computing. llnl.gov/tutorials/openMP/. 8. McMichael, R. D.: µMAG – Micromagnetic Modeling Activity Group. Center for Theoretical and Computational Materials Science, http://www.ctcms.nist.gov/~rdm/ mumag.html. 9. Moser, J. R.: Optimizing linker load times. LWN.net - Your Linux info source, http: //lwn.net/Articles/192082/, 2006. 10. Chandra, R., Dagum, L., Kohr, D., Maydan, D., McDonald, J., and Menon, R.: Parallel Programming in OpenMP. Morgan Kaufmann Publishers, 2001. 72

73

CITED LITERATURE (Continued)

11. Dagum, L. and Menon, R.: OpenMP: An Industry Standard API fo Shared Memory Programming. Computational Science and Engineering, 1998. 12. Gropp, W., Lusk, E., and Skjellum, A.: Using MPI - Portable Parallel Programming with the Message-Passing Interface. Scientific and Engineering Computation Series. The MIT Press, 1999. 13. Reinders, J.: VTuneTM Performanc Analyzer Essentials. Intel Press, 2007. 14. Stevens, W. R.: UNIX Network Programming: Networking APIs: Sockets and XTI. Prentice Hall, 1998. 15. Butenhof, D. R.: Programming with POSIX R Threads. Addison-Wesley Professional Computing Series, 1997. 16. Shipman, G. M., Woodall, T. S., Graham, R. L., Maccabe, A. B., and Bridges, P. G.: Infiniband scalability in Open MPI. Technical Report 1-4244-0054-6/06, Advanced Computing Laboratory, Los Alamos National Laboratory and Dept. of Computer Science, University of New Mexico, 2006. 17. Sur, S., Koop, M. J., and Panda, D. K.: High-performance and scalable MPI over Infiniband with reduced memory usage: An in-depth performance analysis. Technical Report 0-7695-2700-0/06, Department of Computer Science Engineering, Ohio State University, 2006. 18. Quintero, D., Conrad, N., Desjarlais, R., Kahle, M.-E., Kim, J.-H., Nguyen, H.-N., Pirraglia, T., Pizzano, F., Simon, R., Yao, S. L., and Lascu, O.: Implementing InfiniBand on IBM System p. IBM Redbooks, 2007. 19. Gray, A., Hein, J., and Booth, S.: Improved MPI with RDMA. Technical report, EPCC, Univeristy of Edinburgh, June 2005. 20. T., U. and J., R. B. S.: Multithreaded processors. The Computer Journal, 3, 2002. 21. R., B.: High Performance Cluster Computing: Architectures and Systems. Prentice Hall, 1999. 22. R., B.: High Performance Cluster Computing: Programming and Applications. Prentice Hall, 1999.

74

CITED LITERATURE (Continued)

23. Barney, B.: Message Passing Interface (MPI). Lawrence Livermore National Laboratory, https://computing.llnl.gov/tutorials/mpi/. 24. Hablot, L., Gluck, O., Mignot, J.-C., Genaud, S., and Primet, P. V.-B.: Comparison and tuning of MPI implementations in a grid context. Technical Report 1-4244-1388-5, Laboratoir de l’Informatique du Parallelisme, Universite de Lyon, 2007.

VITA

NAME: EDUCATION:

Vittorio Giovara B.Sc. equiv., Computer Engineering, Politecnico di Torino, Turin, Italy, 2007 M.Sc. equiv., Computer Engineering, Politecnico di Torino, Turin, Italy, 2009, under the advising of professors Bartolomeo Montrucchio and Carlo Ragusa Master of Science in Electrical and Computer Engineering, University of Illinois at Chicago, Chicago, Illinois, 2009, under the advising of professors Bartolomeo Montrucchio and Zhichun Zhu

HONORS:

PROFICIENCY Certificate in English, Cambridge University, Turin, Italy, 2004 BTP certification, XX Winter Olympics, TOBO, Turin, Italy, 2006 TOP-UIC Fellowship, Politecnico di Torino, Turin, Italy, 2008

PROFESSIONAL:

Project manager for GLE-MiPS, a VHDL description for processor architecture, focusing on the educational implementation, http://gle-mips.googlecode.com Developer of Hedgewars, a strategy game, managing the Mac OS X and iPhone versions, http://www.hedgewars.org Editor and author for ProjectSymphony, a collection of academic essays and homework reports publically available, http://www.scribd.com/ProjectSymphony

75

Sign up to vote on this title
UsefulNot useful