You are on page 1of 85

Parallel and Distributed Programming on Low Latency Clusters


B.Sc. (Politecnico di Torino) 2007


Submitted as a partial fulfillment of the requirements
for the degree of Master of Science in Electrical and Computer Engineering
in the Graduate College of the
University of Illinois at Chicago, 2010

Chicago, Illinois

To my mother,

without whose continuous love

and support I would never have made it.



I want to thank all my family, my mother Silvana, my grandmother Nenna and my dear

Tanino who help me with love and support every day of my life.

Then I would like to thank all the faculty members that assisted me with this project, in

particular professor Bartolomeo Montrucchio and professor Carlo Ragusa for all the time spent

with me trying to make the software run, and researcher Fabio Freschi for giving me useful

suggestions during development.

Finally I would like to thank all my friends that were near me during these years, Al-

berto Grand, whose patience and kindness towards me are really extraordinary, and Salvatore

Campione, who is an encouraging model for my studies.

V. G.




1 INTRODUCTION . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.1 Evolution of parallel and distributed systems . . . . . . . . . . 1
1.2 Computer architecture classification . . . . . . . . . . . . . . . 4
1.3 Thesis Contents . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6

2 BACKGROUND . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
2.1 Parallel and distributed application developing . . . . . . . . . 8
2.2 Technological requirements . . . . . . . . . . . . . . . . . . . . . 9
2.2.1 SMP processors . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
2.2.2 Multithreading . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
2.2.3 GPGPU computing . . . . . . . . . . . . . . . . . . . . . . . . . 11
2.2.4 NUMA machines . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
2.2.5 Clusters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
2.3 Scientific software advance . . . . . . . . . . . . . . . . . . . . . 14

3 TECHNOLOGY . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
3.1 Parallel applications with OpenMP . . . . . . . . . . . . . . . . 16
3.1.1 Amdahl’s Law . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
3.1.2 Benchmarking . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21 Sequential program with OpenMP enhancements . . . . . . . . 22 OpenMP schedulers performance . . . . . . . . . . . . . . . . . 24 Static Scheduler . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24 Dynamic Scheduler . . . . . . . . . . . . . . . . . . . . . . . . . . 25 Guided Scheduler . . . . . . . . . . . . . . . . . . . . . . . . . . . 26 OpenMP enhancement results . . . . . . . . . . . . . . . . . . . 27
3.2 Infiniband . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
3.3 Distributed execution with MPI . . . . . . . . . . . . . . . . . . 29
3.3.1 MPI over Infiniband . . . . . . . . . . . . . . . . . . . . . . . . . 30
3.3.2 Benchmarks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31 Single message over Infiniband with MPI . . . . . . . . . . . . 31 Multiple messages over Infiniband with MPI . . . . . . . . . . 33 Latency . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34

4 ALGORITHM . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
4.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
4.2 Code Flowchart . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
4.3 Test Case . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39


. . . . . . 54 5. . . . . . . . . . . TABLE OF CONTENTS (Continued) CHAPTER PAGE 4. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .4 Results . . . . . .3 REDUCTION directive . . . . . . . . .3. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48 5. . . . . . . . . . 72 VITA . . . . . . . . . . . 43 4. . . . . . . . 55 6 CONCLUSION . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .1 Reduced test case . . . .2 Final test case . . . . . .5. . . . .4. . . . . 45 5. . . . . . . . . . . .5. . . . . . . .3. . . .1 Native switch . . . . . . 65 Appendix D . . . . . .2 Loop unrolling . . . . . . . . . . . . . . . . . . . .4. . . . . . . . . . . . . . . . . . . .2 Hardware Support . . . .5. . . . . . . . . . . . . . . . . . . . . . . . . . . . . .3. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52 5. . . . . . 47 5. . 43 5 IMPLEMENTATION . . . . . . . . . . . . . . . . . . . . . . . . . 43 4. . . . . . . . . . . . . . . . . .5.4 Profiling . . . . . . . . . . . . . . . .3 IEEE compliance . . . . .3 Applied Directives . . . . . 45 5. . . . 60 Appendix A . . . . 51 5. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69 CITED LITERATURE . . . . . . . . . . . . . . . 41 4. . . . . . . . . . . . . 61 Appendix B . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54 5. . . . . .4 Library Striping . . . . . . . . . . . . . 41 4. . . . . . . . . . 63 Appendix C . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .5 Compiler optimizations . . . 75 vi . . . . . . . . . . . . . . 58 APPENDICES . . . . . . . . . . . . .2 DO directive . . . . . . .3. . . . . . .1 General Scheme . .4 Avoiding data dependency . . . . . . . . . . . . . . . 47 5. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40 4.1 MPI Layer . . 50 5. . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . 55 IV FINAL RESULTS . . . . . . 57 vii . 56 V FUNCTION RESULTS . . . . . . 34 III PARTIAL RESULTS . . . . . . . . . . . . . . . . . 29 II MPI OVER INFINIBAND 0-BYTE MESSAGE LATENCY . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . LIST OF TABLES TABLE PAGE I MAXIMUM DATA THROUGHPUT IN DIFFERENT CONFIGU- RATIONS . . .

. . . . . . . . . . . . . . . 26 10 OpenMP scheduler overview . . . . . . . . . . 23 7 OpenMP static scheduler performance chart . . . . . . . . . 38 14 Standard problem #4 representation . . . . . . . . . . . . . . . . . . . . . . . . . . 46 viii . . . . . . . . . . . 24 8 OpenMP dynamic scheduler performance chart . . . . size for 1024 consecutive messages . . . . . . . . . . 42 17 Implementation scheme overview . . . 33 13 Flowchart of the main functions implementated in the code . . . . . . size for a single message . . . . . . . . 25 9 OpenMP guided scheduler performance chart . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17 4 Graph plotting of the theoretical curve from Amdahl’s Law . . . . . . . . . . . . . . . . . . . . . . . . 32 12 Time v. . . . . . . . . . . . 27 11 Time v. 5 3 Image showing the tree splitting procedure of a sequential task . . . . . . . . . . . . . 3 2 Classification scheme of computer architecture classification . . LIST OF FIGURES FIGURE PAGE 1 Approach levels for parallelization . . . . . . . . . 39 15 S state field representation . . . . . . . . . . . . . . . . . . . 20 6 Performance overview of an OpenMP threaded program . . . . . . 40 16 Call graph scheme of the target software . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19 5 Graph plotting of Amdahl’s Law for multiprocessors . . . . . . . . . . . . . . .

LIST OF ABBREVIATIONS API Application Programming Interface SMP Symmetric Multi-Processing OpenMP Open Multi-Processing MPI Message Passing Interface IPC Inter Process Communication PML Point-to-point Messaging Layer BTL Byte Transfer Layer SISD Single Instruction Single Data SIMD Single Instruction Multiple Data MISD Multiple Instructions Single Data MIMD Multiple Instructions Multiple Data SPMP Single Program Multiple Data MPMD Multiple Program Multiple Data SSE Streaming SIMD Extensions SSSE3 Supplemental Streaming SIMD Extensions 3 UMA Uniform Memory Access NUMA Non-Uniform Memory Access ix .

LIST OF ABBREVIATIONS (Continued) GPU Graphics Processing Units GPGPU General-Purpose computing on Graphics Process- ing Units ECC Error-Correcting Code LLG Landau-Liftshitz-Gilbert equation x .

Results will be provided. that offer robust parallel programming paradigm and an efficient message passing API. running at 2. This target has been achieved by means of open standards. xi . OpenMP multithreading and MPI communication. such as OpenMP and MPI. a point-to-point Infiniband link has been implemented. in order to reduce latency in message passing between the two machines.5 GHz. supplied with 32 GB of RAM and a 20 Gb/s Infiniband network card. developed at “Politecnico di Torino” by the Electrical Engineering department. SUMMARY The goal of this thesis is to increase performance and data throughput of Sally3D. The used hard- ware consists of two computers with two quad-core Intel Xeon processors. showing that it is possible to achieve a 80% speed improvement thanks to optimized code. an electro- magnetic field analyzer and micromagnetic modeler for nanomagnets.

Parallel computing is a simultaneous execution of operations at different levels: the most widely used form of parallelism are bit-level. For this reason a new execution paradigm has been exploited: parallel programming. Nowadays however the technological trend is to control processor frequency and voltage in order to consume less power and generate less heat and in this modern architecture sequential programming is not effective. distributing data independent instructions in a loop among different cores. instruction- level.1 Evolution of parallel and distributed systems Until some decades ago computer applications were written in a sequential style in which the instructions were executed in a fixed order. cluster and grid computing. symmetric multiprocessing. CHAPTER 1 INTRODUCTION 1. loop-level. using complete threads distribution among the cores. exploiting instruction pipelines in processor architectures. augmenting the bit size of words. and task-level. a machine with more than one (multicore) processor. single processor with many processing units. There are many kinds of parallel-oriented computers like multi-core. closely coupled computers connected with high-end networks. hardware support must be present. the programs relied on a single processing unit and the throughput was dependent on the processor speed. In order to be able to use parallel applications. and finally 1 .

On the other hand parallel applications bring some drawbacks at different levels: manually programming threads and concurrent processes is a difficult task. A more effective way was introduced a few years ago in which the programmers could insert hints as compiler directives: in this way it is possible to define sections of code that can be safely parallelized. as data dependency must be carefully handled. the cycle is transformed in a completely sequential program. such as deadlock or starvation. but there is an increased code size proportional to the dimension of the loop and there is still exponential complexity in unrolling very large cycles. However complete automatic parallelization is a very complex operation requiring computa- tional power that has not yet been reached. for this reasons several other approaches have been proposed. A quite simple and somewhat effective technology is loop unrolling activated by proper compiler optimizations. in which execution cannot continue due to resource dependency conflicts. 2 graphics processing units which are used for general purpose computation and are suited for linear and array operations. More- over in a parallel environment several problems are introduced. This is quite beneficial for pipelined processors that present a high overhead for jump operations. Subsequently there has been an increasingly research effort to circumvent the difficulties of parallel programming. The interaction level in this . instead of translating a loop into a sequence of operations followed by a jump. and poor programming styles may lead to performance degradation. exploiting the full capabilities of multicore processors. preventing a lot of jumps and processor flushes. trying to achieve the automatic parallelization from the compiler.

full parallelization is fully achieved when it is set up as a goal during a program design. 3 methodology is more advanced with respect to loop unrolling as it requires deeper knowledge of the program and of dependency between variables. The next figure (Figure 1) shows different parallelization methodology and in-depth level approach. Approach levels for parallelization . however even limited insertion of compiler directives has a major effect on parallelization and program throughput. but it is possible to adapt the project during development at different stages. as it may seem obvious. each requiring an action of different difficulty. Figure 1.

4 1. when the first MMX extension was added to Intel processors. representation of each classification is gathered in the Flynn’s taxonomy: SISD computers are traditional machines with a single processor operating on a single instruc- tion (or data) stream. Multimedia opera- tions are the prime beneficiaries for this application as well as cryptography and data compression. This is the oldest architecture design and was the leading model in computer markets until a decade ago. but it is often found in mission critical applications. it was possible to consider multiple or single instructions operating on multiple or single data. in which a dependable system must be developed. SIMD is the general modern architecture commonly found in current processors in the form of SSE. From a single processor model that operates on a single data stream.2 Computer architecture classification As soon as parallel computation theory began to gain popularity. . technology present in SPARC processors. Altivec and VIS1 instructions among others. As a matter of fact operating on single data with multiple identical in- 1 Visual Instruction Set. most recently GPUs have started to exceed this paradigm with emphasis on vectorial parallelization. often stored in a single memory. there was a shift in computer architecture design and a precise classification was needed. MISD architecture is an uncommon one as there is no performance benefit from this design.

processors may function asynchronously and independently. Figure 2. 5 structions may lead to error detection and error correction with means of hardware and time redundancy. Classification scheme of computer architecture classification . MIMD systems are suited for computer clusters in which a shared or distributed memory is used. Parallelism is achieved because at any time computers may be executing different instructions on different data.

6 There might be some other classification for the MIMD class.3 Thesis Contents In this thesis it is described how to make use of such levels of parallelization directives for a completely serial numerical program. two technologies have been adopted: MPI and Infiniband. MPMD implementation of a client/server model in which a master feeds other nodes with data and coordinates the workload distribution. with high level plotter resolution. Infiniband on the other hand . 1. As for the distributed part of the algorithm. The program consists in an equation solver written in FORTRAN language adapt for com- putation of electromagnetic field analysis. for this reason what has been selected for parallelization technology is OpenMP which offers a set of compiler directives to extend sequential sections of code on every core of the machine. For this reason a MIMD system will be exploited. it is not possible to abstract to a very high level methodology. so each node executes a different set of programs on different data and reports its result to the master. MPI is an high level API for performing Inter Process Communication on the same machine or on different nodes available for many different programming languages (even for those which do not have IPC mechanism capabilities). in which the concept of “instruction” is extended to the notion of “program”: SPMD multi processors execute the same program at the same time. in order to increase computational performance over a distributed and parallel environment. but at independent points in the code while working on the different data. Since the program is already provided.

After introduction. . 7 was chosen for its outstanding performance in sending small quantities of data with very little latency. this document will present a general background and previous work regarding parallel application methodologies. finally some results will be submitted. tracing the throughput growth of the program with OpenMP and MPI directives. Then the main algorithm of the program will be outlined. followed by a thorough description of the tech- nologies used in this research. showing the critical points in which a possible performance increase may be achieved through parallelization or distribution.

• Geology. Some examples (source: Livermore Computing Center): • Atmosphere. condensed matter. Genetics. 8 . Biotechnology. fusion. and has been used to model difficult scientific and engineering problems found in the real world. nuclear. high pressure. Earth. web based business services. • Computer Science. Seismology. • Mechanical Engineering .applied. particle. • Chemistry.1 Parallel and distributed application developing Historically.from prosthetics to spacecraft. Mathematics. – Oil exploration. Microelectronics. • processing of large amounts of data in sophisticated ways such as: – Databases. data mining. Circuit Design. Environment. • Physics . Molecular Sciences. photonics. parallel and distributed computing has been considered to be “the high end of computing”. • Electrical Engineering. CHAPTER 2 BACKGROUND 2. • Bioscience. – Web search engines.

– Management of national and multi-national corporations. An SMP architecture refers a computer system composed of multiple processors connected to a single shared memory and to a shared I/O controller. Operating system support is necessary for enabling this feature. – Collaborative work environments.2. – Financial and economic modeling. trying to simplify program parallelization for developers. Resorting to a SMP architecture can bring many advantages (1): . Moreover programs have to be rewritten or at least reconsidered in order to access every resource available. 9 – Medical imaging and diagnosis. – Advanced graphics and virtual reality. particularly in the entertainment industry.1 SMP processors As demands for performance increases and as the cost of microprocessors continues to drop. 2. the single processor model has been abandoned in favor of an SMP organization. For this reason there has been a continuous improvement to compiler software. – Pharmaceutical design. – Networked video and multi-media technologies.2 Technological requirements 2.

. 2. as it handles thread scheduling and processes synchronization. Scaling – vendors can offer more systems with different SMP configuration.2 Multithreading Multithreading is a technique to exploit thread-level parallelism. moreover interrupt management can affect only one processor at time. Once again. Incremental Growth – adding additional processors increases performance even more. unless one is not ready (blocked for data dependency or memory latency). through the operating system support (2). 5. Transparency – the operating system hides SMP management from the user. Availability – it is possible to set up the processor to execute the same instruction on all the symmetric processors. unit of execution becomes a single thread of the program in memory. It is possible to increase execution parallelism by using one of the following implementations: interleaved multithreading (fine-grained ) at every clock cycle the processor switches exe- cution from one thread to another. 3. avoiding processes suspension and pipeline stalls. 4. being able to sustain hardware failures (sort of MISD architecture). 10 1. 2. Performance – workload can be spread among more processors. it is necessary to enable this feature in software.2. up to a certain extent. running different tasks in parallel.

This methodology allows to exploit the GPU computing power. mathematical solvers and raytracing .3 GPGPU computing General-purpose computing on graphics processing units refers to a technique that allows general purpose execution through the processors present in modern video cards (namely. until an event causes delay or cache miss. 11 blocked multithreading (coarse-grained ) instructions of the threads are continuously exe- cuted. physics simulators. each handling separate thread sets. for almost any kind of operations. The Simultaneous Multithreading technique has been implemented in most modern proces- sors as it has shown the best performance benefits in a variety of applications during testing. in this way pipeline execution is much simplified. that is usually re- served for computer graphics.2. simultaneous multithreading (SMT or Hyperthreading) instructions from multiple threads are simultaneously executed. such as digital signal processing (for audio/video or image manipulation). in that case execution is switched to another thread. Applications that especially benefit from streaming execution are multimedia-related. GPUs). since the graphics processing unit is composed of a lot of array processors. but there are also many implementations of computer clusters. using a GPGPU programming language enables automatic streaming execution. chip multithreading one or more processors is simulated on the physical chip. 2. exploiting intrinsic parallelism of the execution units of the processor.

Memory is mapped like a global address space. DNA folding.4 NUMA machines While general purpose processors adopt a uniform memory access (UMA). One final disadvantage is that it is becoming increasingly difficult and expensive to design and produce shared memory machines with ever increasing numbers of processors. However there is a lack of scalability between memory and CPUs because adding more CPUs can geometrically increase traffic on the shared memory-CPU path. like cryptography. merging the linked SMP memory. neural networks and medical imaging. meaning that one node can directly access memory of another node and that not all processors have equal access time to all memories. Moreover there is older array-based software that receives a positive impact from this rather new technology. NUMA machines are usually physically distributed but logically shared. 2. a software layer is often needed to guarantee program access and workload distribution. Moreover there is a whole synchronization construct that needs to be implemented to insure “correct” access of global memory. . this feature provides a user-friendly programming perspective to memory as data sharing between tasks is both fast and uniform due to the proximity of memory to CPUs.2. 12 done with GPGPU. non-uniform). it is not un- common to find systems whose access time is not uniform and depends on the position of the processor (NUMA.

5 Clusters A cluster is an alternative or an addition to symmetric multiprocessing for achieving high performance. while clusters require an in-depth program revision. For example an SMP system is easier to manage and has less problems in running single- processor software. even in small increments. . it is possible to define “cluster” as a group of computers interconnected through some network interface. clusters dominate the final performance outcome and offer more solutions for availability. both clusters and SMP systems provide a configuration for high performance applications and they can both introduce advantages and disadvantages. working together as unified computing resource. they usually exploit redundancy so that when one node fails. It is possible to create large clusters that can by far outperform any standalone machine. though. it can be immediately substituted by a spare one (active or passive standby). with the advantage that is is relatively easy to add new components. Load-balancing clusters with the primary purpose of distributing evenly the workload of a given task or service among the rest of the cluster. on the other hand. 13 2. Compute clusters used for computational activity. rather than services. Clusters are historically divided in: High-availability clusters for improving the availability offered by the cluster itself. with load balancing and work distribution.2. nodes are tightly coupled and usually computation implies a consistent quantity of communication involved.

like the Pleiades and the Ranger that use Infiniband as connection link among the clusters. but the addressed software has different solving and modeling routines. Grid computing similar to compute clusters. . The proposed work doesn’t rely on standard FEM 1 project ranking and detailing the 500 most powerful known computer systems in the world. As a matter of fact it is normal to find quite a number of projects that exploit those technologies. they focus more on the final computational throughput rather than workload distribution and tightly coupled jobs. which is being developed using an MPI layer between its nodes.g. there has been some previous work with OpenMP: (3) and (4) describe a possible implementation for Hybrid solvers. computation con- sists of many independent jobs which do not have to share data during the computation process. MPI).3 Scientific software advance Using parallelization technologies such as OpenMP and MPI. or it is possible to find many entries from the TOP500 list1 . For example it is possible to cite the Folding@Home project. As for electromagnetic field analyzers. 14 usually programs can be easily ported to this environment through simple instruction routines (e. is not new in scientific soft- ware. from the Stanford University’s chemistry department. currently the most powerful distributed computing cluster. 2.

but takes on a Finite Formulation of nonlinear Magneto-static algorithm which can be safely parallelized and distributed. 15 approach. see (5) for more information. .

by default every thread executes its section of code independently. C++ and FORTRAN programs. threads are then joined back in the main (or master) thread. With this technology the main program forks a set number of parallel threads which carry out a task. FORTRAN. 16 . OpenMP stands for Open Multi-Processing and it is implemented in many open source and commercial compilers. Among the key factors for its popularity there is the easiness of handling threads and shared variables and the simplicity of porting programs to a multiprogramming scheme with very little code change. resuming normal sequential programming.1 Parallel applications with OpenMP OpenMP is an application programming interface (API) that offers a set of compiler direc- tives. for instance. like Intel C++ and FORTRAN Compilers (ifort and icc) and GNU Com- piler Collection (gcc). in this way it is possible to divide the sequence of program execution in a tree-like structure (as shown in Figure 3). After execution of the parallel job. CHAPTER 3 TECHNOLOGY 3. like. moreover OpenMP enables parallel execution control for languages that cannot usually handle multi threading and synchronization primitives. library routines and environment variables to enable shared memory multiprocessing for C. dividing the work load on different cores.

either shared or private. . all variables of the parallel section must have a declared visibility scope. Image showing the tree splitting procedure of a sequential task OpenMP exploits preprocessor directives for thread creation and synchronization. while retaining compatibility with unsupported compilers. In order to prevent data corruption due to overlapping threads. workload distribution and sharing. Other directives may directly manage thread interaction and synchronization objects (critical regions and variable locking). data and function management. 17 Figure 3. One directive is particularly suited for loop parallelization as it offers a fine-grained control on the scheduling for the threads and on the distribution of the loop among the thread pool.

As a matter of fact there are a couple of reasons for this to apply: • Symmetric Multi Processor computer have increased computational power. context switch costs and load bal- ancing among the threads may reduce the final speedup. per- formance degradation occurs especially when the shared memory bandwidth is filled up and data transfer is slowed down. critical region management. The speedup highly depends on the size of the parallelizable code (6). • not every portion of the code can be actually parallelized: • the theoretical limit imposed by Amdahl’s Law for parallel applications that regulates the maximum theoretical speedup holds.1 Amdahl’s Law Amdahl’s Law is a method used for finding the maximum speed improvement in parallel computing environments.1. The formula states that the potential speedup of the program directly depends on the fraction of code P that can be parallelized 1 speedup = (3. but the mem- ory bandwidth does not scale proportionally to the number of processors (or cores). it is important to clarify that using OpenMP on an N processor machine does not reduce the execution time by N. 3. 18 However. • synchronization overhead.1) 1−P .

Graph plotting of the theoretical curve from Amdahl’s Law When the code has parts that cannot be parallelized. Figure 4. meaning the code will run twice as fast. if 50% of the code can be parallelized. and so on. the next figure (Figure 4) shows the theoretical speedup curve with infinite processors. maximum speedup = 2. 19 Basically if none of the code can be parallelized. the relationship can be updated to 1 speedup = P (3. P = 0 and the speedup = 1 (no speedup).2) N +S . if all of the code is parallelized. P = 1 and the speedup is infinite (in theory).

20 where N is the number of processors. P the portion of parallelizable code and S the portion of serial code (corresponding to (1 − P )). Graph plotting of Amdahl’s Law for multiprocessors . It is possible to see not only that a 95% parallelizable program has a maximum speed improvement in the order 20x notwithstanding the high number of processors available. The following figure (Figure 5) shows a set of examples with different parallelizable code over a variable number of processors. Figure 5. but also that a highly sequential program cannot achieve any acceleration whatsoever.

but then dynamically assigned to thread when one task is completed.2 Benchmarking In order to understand the possible benefit from using OpenMP. plus the total number of threads involved in the program. As it can be foretold. DYNAMIC loop iterations are divided in chunk number of iterations. GUIDED the chunk size is rearranged proportionally to its value allowing unassigned iteration to gain priority over completed tasks. Other type of schedulers may be auto and runtime in which one of the above scheduler is selected accordingly to the CPU load and the set up environment.1. while the scheduler type may be: STATIC loop iterations are divided in fixed chunk number of iterations. The two main configuration variables that characterized the benchmarks were the scheduler type and the chunk size. some tests have been run targeting the best possible configuration about the number of threads and the thread size. The chunk size is an integer positive value representing the number of iterations each thread has to manage. guided- scheduled threads work best with very small chunk sizes (with respect to the total number of . A simple test program was used with a complex and long loop containing some processor inten- sive operations (mainly mathematical operations like power and square root). 21 3. The particular case of an “interesting” loop has been chosen because it showed with enough simplicity the effort/benefit ratio of OpenMP.

Beware that setting a static number of threads may reduce the total performance of the application. as the scheduling algorithm is more efficient when it can control a pool of threads on its whole. 22 iterations). It’s interesting to notice that the three schedulers perform in same range of values and that the best performance is achieved in the region of 8-9 threads (given the eight-core machines used). while the static and dynamic scheduling prefer having a medium chunk size value.2.1. the main loop is composed of several mathematical functions that are known to stress the processor and require a long cpu time to be carried out. 3. After this value all the schedulers. The code is reported in appendix B. suffer from excessive context switches and interference from the operating system preemption mechanism. The test program partially emulates some computationally intensive routines of the target software. fully respecting Amdahl’s Law.1 Sequential program with OpenMP enhancements In this first test the program is speeded up with increasingly higher number of threads avail- able. All three scheduling algorithms are evaluated. . The value of the first column (one thread) may be safely considered as reference for the program without OpenMP optimizations. It is possible to see that there is a huge impact when inserting a second thread (50% time reduction) and then it asinthotically tends to a given value. also overcoming the eight physical cores actually present. As a matter of fact the thread number in the main program has been left to the default value for this very reason. static and the dynamic in particular.

Performance overview of an OpenMP threaded program . 23 Figure 6.

2 OpenMP schedulers performance Having evaluated the performance of the different threads. 24 3. It is interesting to notice that for very high chunk size OpenMP can’t reduce the execution time. and this holds for every type of scheduler.1. OpenMP static scheduler performance chart . now the three types of available schedulers are compared.1. the reason of this behavior resides in how OpenMP manages iterations – all iterations of the loop are assigned to a single thread and therefore there is not any benefit.2. 3. moreover for each scheduler a different order of chunk size is tested.1 Static Scheduler The static scheduler works as expected (Figure 7) showing a very good performance increase in region of 7-8 threads with 10-100 as chunk value.2. Figure 7.2.

2.2. OpenMP dynamic scheduler performance chart . the best execution time reduction is achieved in region 7-9 by chunks of medium order.1. the dynamic scheduler shows very peculiar results with different configurations. 25 3. Figure 8.2 Dynamic Scheduler Because of its dynamic behavior. Even with this disparity however. For example. or small chunks that cannot leave the average value regardless of the thread number. as shown in Figure 8. there are high chunks and little number of threads that present even an additional overhead.

2. Figure 9. 26 3. For a chunk size with value k (greater than 1). decreasing to 1.3 Guided Scheduler The final scheduler presented here is the most straightforward and the best performing. OpenMP guided scheduler performance chart . As a matter of fact for a chunk size of 1. as it can apply its algo- rithm without interferences. the size of each chunk is proportional to the number of unassigned iterations divided by the number of threads. thanks to the more advanced algorithm of the guided scheduling. which may have fewer than k iterations) – source (7). As anticipated. the size of each chunk is determined in the same way with the restriction that the chunks do not contain fewer than k iterations (except for the last chunk to be assigned. and always in the 8-9 threads region. this algorithm works best with very small chunks.2.1.

27 OpenMP enhancement results

This last section resumes the global results from the point of view of the scheduler. As

reference value, the maximum time execution reduction has been selected from each chunk of

each scheduling algorithm; all these results come from the 7-9 threads region.

The test run shows that the scheduler that performed best is the guided scheduler with

chunk size in the order of the units, and for this reason it has been chosen as default scheduler

in all OpenMP directives inserted.

Figure 10. OpenMP scheduler overview


3.2 Infiniband

Infiniband is the union of two competing transport designs, Next Generation I/O from Intel,

Microsoft and Sun, and Future I/O from Compaq, IBM and Hewlett-Packard. It has become

the de facto standard for high speed cluster interconnection, outperforming Ethernet in both

transfer rate and latency.

This technology implements a modern interconnection link using a point-to-point bidi-

rectional serial transfer, supporting several signaling rates. It is used for high-performance

computing either for high-speed connection between processors and peripherals as well as for

low-latency networking.

The standard transmission rate is of 2.5 Gbit/s, but double and quad data rates currently

achieve 5 Gbit/s and 10 Gbit/s respectively. Moreover it is possible to join links in units of 4 or

12 elements enabling even further transfer speed (up to 120 Gbit/s). However it is important

to state that a fault prevention for transmitted data is adopted using information redundancy:

every 10 bits sent carry only 8 bits of useful information, reducing the useful data transmission

rate. Table I summarizes the various configuration effective data rate.

Most notably, there is no standard programming interface for the device, only a set of

functions (referenced as verbs) must be present, leaving implementation to the vendors. The

most commonly accepted implementation is provided by the OpenFabric alliance. Being a

transport layer there are many protocol that can run on Infiniband, from TCP/IP to OpenIB

(described in section 3.3.1).




useful data Single Data Rate Double Data Rate Quad Data Rate
1X 2 Gbit/s 4 Gbit/s 8 Gbit/s
4X 8 Gbit/s 16 Gbit/s 32 Gbit/s
12X 24 Gbit/s 48 Gbit/s 96 Gbit/s

raw data Single Data Rate Double Data Rate Quad Data Rate
1X 2.5 Gbit/s 5 Gbit/s 10 Gbit/s
4X 10 Gbit/s 20 Gbit/s 40 Gbit/s
12X 30 Gbit/s 60 Gbit/s 120 Gbit/s

3.3 Distributed execution with MPI

MPI is a high level language-independent API used both for parallel computing and for one-

to-one, one-to-many and many-to-many inter process communication (IPC). It has become the

de facto standard for process communication despite of lack of sponsorship by any association.

Originally it was developed by William Gropp and Ewing Lusk among others.

This set of API is used for high-performance computing for its scalability, portability and

performance, as it implements a distributed shared memory system with very few directives. It

usually resides on level 5 of the OSI model, but, as there is no strict constraint on this point,

there are many implementation that offer different transport, network and data link layers.

MPI is available for many programming languages including C, C++, FORTRAN and

Java; sometimes implementations benefit from the bounded language, for example using object-

oriented programming in C++ and Java, and from the hardware they run on. Among the


most diffused library it is possible to find OpenMPI, MPICH2 and MVAPICH2 which differ

only for threading support, network availability (e.g. Ethernet or Infiniband) and hardware


3.3.1 MPI over Infiniband

One of the most widely used environments for MPI is Infiniband; as a matter of fact thanks

to Infiniband low latency a small packet sent through a connection link doesn’t present a major

overhead with respect to Ethernet for example. In order to set up a distributed system of this

kind there is need of additional software for managing the Infiniband sub net (OpenSM) and

for handling the transport layer (OpenIB).

MPI and Infiniband modularity allow different configurations, and it is common use to

transmit packet with either Infiniband or a TCP/IP stack. This is possible because the transport

layer of MPI is handled by two routines (among others): the Point-to-Point Messaging Layer

and the Byte Transfer Layer. The PML abstracts the communication mechanism with buffers,

synchronization points and acknowledge messages; the BTL on the other hand translates the

byte messages into the network layer byte sequence – OpenIB is a BTL protocol for sending

messages on Infiniband.

Subsequently the functions (or verbs) available in the Infiniband drivers are invoked and

control is moved from user space to kernel space, where the message is finally sent across the

network link.

This seemingly complex structure allows to reduce code complexity and increase inter-

compatibility and maintainability between different implementations.

3.5 seconds before programs start executing (and this is not recorded in this test). but on the other hand it has a sort of latency of 3. . because the whole structure of MPI plus Infiniband must be activated. 31 3. some tests were also performed on the MPI installation and on Infiniband structure to check that machine configuration was correct and that devices were running at full speed. but it is slower in medium-sized messages.2 Benchmarks As it has been done with OpenMP. Data is displayed in a semi-logarithmic scale so that the whole slope can be shown. With MVAPICH it is not possible to send data over 2 GB. Two different MPI implementation are compared. OpenMPI doesn’t suffer from this behavior. and it possible to notice that OpenMPI outperforms MVAPICH in small and large quantities of data. The program makes heavy use of the MPI Send and MPI Recv directives and utilizes timing function with resolution of milliseconds. but they lack of support for Infiniband.1 Single message over Infiniband with MPI In this test the transfer time of messages over Infiniband with MPI directives is evaluated. due to implementation limits. It has been noticed that a warm-up phase (exchanging some messages between the nodes) is necessary before any measurement is done.2.3. most notably MPICH and Lam-MPI. from which both MVAPICH and OpenMPI derived. any packet transmitted would revert to plain TCP/IP. 3. Other types of MPI implementation exist. message size increases quadratically and time is measured with millisecond precision.

Time v. size for a single message . 32 Figure 11.

3.2. The results are similar to the previous case. Time v. size with multiple messages (1024 messages exchanged for each tested size). size for 1024 consecutive messages . 33 3. here is tested the time v. Figure 12.2 Multiple messages over Infiniband with MPI Using the same structure of above.

The latency value measured with this method is 8 µs which is compatible with the Infiniband board specifications. the MPI initialization overhead must be removed: for this reason the same test is to be repeated both on a single machine and on the two machines.2. TABLE II MPI OVER INFINIBAND 0-BYTE MESSAGE LATENCY Test type µ-seconds Single node 26 Two nodes 34 Latency 8 . The complete table of results follows.3 Latency One final test has been run to determine the expected latency in message passing. due to the modularity of the MPI over Infiniband structure. 34 3.3. However. this has been achieved by sending a 0-length packet using some data types available in MPI.

1 Overview The target application is a suite of programs called Sally3d. time is measured in unit of (γMs )−1 (γ is the absolute value of the gyromagnetic 35 . Zeeman). CHAPTER 4 ALGORITHM 4. and it has been ported from a VMS system to standard FORTRAN. The effective field takes phenomenologically into account the interactions occurring in mag- netic materials such as short-range (exchange. anisotropy) and long-range interactions (mag- netostatics.1) ∂t ∂t where m = m(r. for this purpose. The software is designed for electromagnetic field analysis and micromagnetic modeling of nanomagnets. (4. with a standard makefile instead of terminal scripts and it can be compiled on any UNIX based operating systems. magnetization dynamics in nanomagnets is described by the Landau-Lifshitz-Gilbert (LLG) equation which rules the gyromagnetic precession of magneti- zation vector field around the so-called micromagnetic effective field. t) is the magnetization vector field normalized to the saturation magneti- zation. Ms . Magnetization dynamics in a ferromagnetic body is described by the following Landau-Lifshitz-Gilbert (LLG) equation: ∂m ∂m   = −m × heff [m] − α .

t)] can be considered spatially uniform within each cell. the mesh vectors m = (m1 . hm and ha are the demagnetizing and applied fields. In order to obtain a spatially discretized version of eq. (4.2) δm where Z " 2 # 1 l ex 12 gL [m] = |∇m| − hm · m + ϕ(m) − ha · m dV . with volume Vk is considered and is assumed that the cells are small enough that the vector fields m(r. t) and heff [m(r. (Equation 4. . heff [m(r. and VΩ is the body volume. . Beside the cell vectors. In addition. the homogeneous Neumann boundary condition ∂m/∂n = 0 is imposed at the body surface. (4. . respectively. t)] is the effective field operator which can be obtained by the variational derivative of the free energy functional: δgL [m] heff [m] = − .1) a partition of the region Ω in N cells Ωk . . 36 ratio).3) VΩ Ω 2 2 p ϕ(m) is the anisotropy energy density and lex = (2A)/µ0 Ms2 is the exchange length (A is the exchange constant and µ0 the vacuum permeability). Symbols mk (t) and heffk denote the vectors associated with the generic k-th cell. mN )T ∈ R3N containing the whole collection of cell vectors are defined. α is the dimensionless damping parameter. Now it is possible to write down the discretized LLG equation in the following form that consist of a system of ordinary differential equations: .

The numerical solution of equation (Equation 4.4) dt dt where mk is the average magnetization of the k-th cell. the next value of the magnetic vector is computed by collecting the different finite elements of the magnetic field. whose solution is obtained through Newton-Raphson method of approximation. namely heffk = heffk [m].2 Code Flowchart The kernel of the micromagnetic solver integrates over time the LLG equation discretized with respect to space. (4. The equation is a non linear differential equation. this is performed by the GINT function. 4. The section of code which has been parallelized and distributed (outlined with yellow in the next figure) implements the magnetostatic and anisotropic field solvers. This development scheme has been chosen on the grounds that the real computational bottleneck resulted particularly in the magnetostatic solver and partially in the anisotropic solver.4) will provide the time evolution of magnetization. this operation is performed by the GILBERT routine and it is reported in Figure 13. At every time step. also the part that combines together the different field elements has been updated with OpenMP and MPI direc- tives. It is worth noting that the ef- fective field in the k-th cell depends on the magnetization of the whole cell collection due to the magnetostatic interaction. . 37 dmk dmk = −mk × heffk [m] + αmk × .

Flowchart of the main functions implementated in the code . 38 Figure 13.

1] direction to zero.0 mT (corresponding to approximately 25 mT. µHy = 4. The problem will be run for two different applied fields. proposed by Bob McMichael. Figure 14. At t = 0 one field will be applied to the equilibrium s-state: the field is composed of µHx = −24. Fields of magnitude sufficient to reverse the magnetization of the rectangle are applied to this initial state and the time evolution of the magnetization are examined as the system moves towards equilibrium in the new fields. directed 170 degrees counterclockwise from the positive x axis). Roger Koch and Thomas Schrefl. Standard problem #4 representation . 39 4. the problem focuses on dynamic aspects of micromagnetic computations.1.3 mT.6 mT. The initial state is an equilibrium s-state (Figure 15) which is obtained after applying and slowly reducing a saturating fild along the [1. The test case is the fourth standard problem of micromagnetics. as well as to obtain useful data. a particular test was prepared.3 Test Case In order to carefully analyze the performance of the program and to identify the possible parallelization points. µHz = 0. Quoting (8).

whose graph call has been reported in Figure 16. This utility allows to obtain procedure level timing information with reasonable resolution. the following functions were the most time consuming: .4 Profiling Thanks to the standardization of the program code. The field amplitudes were chosen to be about 1. in the case of the field applied at 170 degrees. Preliminary simulations reveal that. the center rotates the opposite direction as the ends resulting in a more complicated reversal. as well as a complete call graph view for identifying the most computational expensive functions.5 times the coercivity in each case. In the 190 degree case. According to the profiler. S state field representation 4. however. 40 The problem was chosen so that resolving the dynamics should be easier for the 170 degree applied field than for the 190 degree applied field. it was possible to exploit the gprof utility. Figure 15. the magnetization in the center of the rectangle rotates in the same direction as at the ends during reversal. available in the gcc suite.

only the noted functions have an observable impact on the overall execution time. In this way all processor specific instructions can be accessed and all floating point capabilities fully exploited. setting the right processor architecture and the available SSE flags.5. thus very difficult to optimize and to measure (in fact they are not even reported in profiler reports). in order to activate at once all the features of a given architecture and of a given processor is required to set -march=native. due to the porting operation that has been performed.1 Native switch The key for optimization relies on the native machine capabilities. Most of the additions have been chosen following official gcc documentation and manual pages.5 Compiler optimizations Once again. 4. 4. several compiler opti- mizations became available and were subsequently added in order to increase the throughput of the program. . 41 • calc intmudua • curledge and the calling calc hdmg tet • calc mudua • campo effettivo Most of the software is composed of very small routines that are called with very high frequency. Moreover the floating point instructions are specifically set to use any SSE extension (-mfpmath=sse enabled by default).

Call graph scheme of the target software . 42 Figure 16.

in this way the number of jumps and of conditional branches decreases. 43 A similar optimization is achieved also in the Intel FORTRAN Compiler with the -axS -xS switches.4 Library Striping One final type of optimization has been inserted at linking time. ISO and or ANSI compliance that specify arithmetic compatibility. size the number of cache hits increases (in big caches). Loop unrolling is performed by reducing (if not eliminating) the number of the “end of loop”.5. The main drawback to this optimization is that it is not possible to guarantee IEEE. The following options try to decrease load time for library functions. exceptions and operand order in floating point operations. the -ffast-math flag has been added: this flag activates a set of optimization that allow some general speedups by discarding some return codes and by skipping some redundant operations (like rejecting the sign of zero or not considering Nan and +-Inf number types).5. modifying the executable header (ELF in this . 4.2 Loop unrolling Among the loop transformation techniques.5. This optimization is pulled in by the -O3 flag. 4. loop unrolling has achieved wide success in compiler theory. Its goal is to increase the execution speed of the program at the expense of size. 4. and thanks to the larger.3 IEEE compliance Due to the highly mathematical nature of the software.

increasing hash keys length and header size. but actually reducing symbol look-ups. This optimization shifts the reduction towards short hash chains. 44 context) and symbol handling (9). they are stored in hash tables. performing string comparison with collisions. . More specifically the -O1 switch performs in this way: as symbols get inserted in the ELF header. These options must be passed with the -Wl flag so that the compiler can forward them to the linker. the default configuration is to keep the hash keys small.

the loop is divided among a given number of OpenMP threads and each carries out a portion of that iteration. CHAPTER 5 IMPLEMENTATION 5. a new one is created and assigned a element. until the whole loop section is completed. every function contained one or more loops. Each part is submitted to a node of the cluster and separately executed.1 General Scheme Analyzing the functions of 4. as soon as a thread ends. By setting up proper shared/private variables lists. As first step. The second step in this strategy is to split in two distinct and equal parts before exploiting OpenMP. the standard sequential loop is parallelized to fully exploit all the eight cores each single machine can offer. 45 .4 from several profiling sessions a common pattern has been found. carrying quite a number of instructions over arrays and matrices. For this reason a general plan has been decided and summed up in Figure 17. at the end of the loop data is exchanged back with MPI and merged so that the two machines can continue working on complete arrays. Thanks to Infiniband. As a matter of fact. latency for exchanged data sets is reduced to a minimum.

that the software is not embarrassingly parallel. Implementation scheme overview It should be noted. as a matter of fact there were a number of modification to the software in order to apply parallelization and . Figure 17. 46 Even though OpenMP requires little software modifications. some updates have been carried out. mainly reducing portions of redundant code. however. in order to obtain the maximum possible throughput from the software.

running at full speed as the cards are mounted on the PCI Express x16 v1.4. the other cannot continue.1 slot. The synchronization object mostly used is the implicit blocking offered by the send() and recv() mechanism. 47 distributed computing. • an Intel Server Board S5000PSLSATAR motherboard.3 Applied Directives In this section some example code has been extracted from the source of the program and explained. as shown in 5. • 32 GB of ECC DDR2 RAM. The focus for building these computer has been to search for low-cost components that could enable high performance results. The two machines are connected together with an end-to-end Infiniband link.3. model ConnectX IB MHGH28-XTC DDR HCA PCI-e 2. until either of them is ready to process data. 5. since data is exchanged between the two machines in the same manner.0 x8 Memory Free.2 Hardware Support The hardware selected for implementing the cluster consists of two computer. In other sections of the code. with 6 MB of L2 cache. . each supplied with: • two quad core Intel Xeon E5420 running at 2.5 GHz frequency. synchronization was achieved by native OpenMP directives. • one Infiniband card from Mellanox. 5.

0) then start_INDEX = 1 end_INDEX = NEDGE/2 else if (rank .1 MPI Layer The following sections of code show some sample “header” and “epilogue” MPI functions that enable slitting the array and merging it back. #ifdef MPI_ENABLED if (rank . 48 5.eq.eq. Some preprocessor directives have been inserted in order to maintain compatibility on non MPI system. The header part analyzes the rank variable which differs for every node of the MPI cluster: inside the if clause the array range is defined by setting start INDEX and end INDEX variables (which intuitively represent the range beginning and ending). allowing both machines to operate on separated data subsets. So the first node works on the first half of the array and the second node on the second half. 1) then start_INDEX = ( NEDGE/2 ) + 1 end_INDEX = NEDGE endif #else start_INDEX = 1 end_INDEX = NEDGE #endif .3.

& dest. . the array on which the iteration worked must be synchronized on both nodes. tag. tag.end_INDEX [. #ifdef MPI_ENABLED tag = 1 ISIZE = NEDGE .eq. NEDGE/2. MPI_REAL8. MPI_REAL8. MPI_COMM_WORLD. MPI_REAL8. 0) then dest = 1 source = 1 call MPI_RECV(BINTMU( (NEDGE/2) + 1). MPI_COMM_WORLD. this is done with a couple of MPI SEND and MPI RECV instructions. The rank variable is checked again to be able to tell which portions of the array must be updated. dest. tag. 49 DO M=start_INDEX. ISIZE .eq.NEDGE/2 if (rank . stat. err) call MPI_RECV(BINTMU. MPI_REAL8. 1) then dest = 0 source = 0 call MPI_SEND(BINTMU( (NEDGE/2) + 1). NEDGE/2. ISIZE .. & source. tag. source. err) call MPI_SEND(BINTMU.] So after loop has terminated.. err) else if (rank . & MPI_COMM_WORLD.

!$OMP PARALLEL SHARED(IFAEXT.NPOS.err) endif #endif 5.KH.KCOMP) !$OMP DO SCHEDULE(GUIDED) DO I=start_INDEX.AMAG.. stat.IMAG.BINTMU.] ENDDO !$OMP END DO !$OMP END PARALLEL .] BINTMU(I)=BINTMU(I)- + IFAEXT(I. It requires a list of shared and private variables: for the latter case..end_INDEX [.KCOMP) [.3.2 DO directive The DO directive is the most common in this configuration..1. Workload is distributed accordingly to the selected scheduler as described in 3. a new memory position is allocated for each thread.TM) !$OMP& PRIVATE(I.. 50 & MPI_COMM_WORLD.KH.KK.2)*TM(NPOS+(IMAG-1)*3+KCOMP)*AMAG(IMAG.

VOLTET. 51 5.D0 ENDDO !$OMP END DO .DOT) !$OMP& SHARED(H_DEMG.K)*AMAG(M.3 DOT=DOT+H_DEMG(M. Without this directive the target variable could have suffered from various synchronization problems.D0 DO K=1.K.3 REDUCTION directive One of the possible benefits in parallelization is to use a mathematical property for addition and subtraction clauses: since variating the order doesn’t change the result.NPNMAG DOT=0.AMAG.NPNMAG) !$OMP& REDUCTION(+:VOLUME) !$OMP& REDUCTION(-:DEMG_ENE) !$OMP DO SCHEDULE(GUIDED) DO M=1.3. as reading and writing to a shared position doesn’t guarantee a correct result. the reduction directive allows to execute out-of-order loop instances and to compute the final value at the end of the iteration. !$OMP PARALLEL PRIVATE (M.K) ENDDO VOLUME=VOLUME+VOLTET(M) DEMG_ENE=DEMG_ENE-VOLTET(M)*DOT/2.

52 !$OMP END PARALLEL Unfortunately this option is available for non-array operators only. 5..6 LATO=(MCNT_E(L.] #ifdef _OPENMP INUM_TH = omp_get_num_threads() #endif [.] DO L=1. so it has been applied few times. In order to avoid inserting a critical region (corresponding to a CRITICAL or ATOMIC OpenMP directive) for shared constructs which could have negatively affected performance..3..4 Avoiding data dependency One of main problems of OpenMP and parallel programming in general is data dependency and this is usually resolved by modifying the algorithm structure or by means of synchronization objects. in this way every array element of the matrix was automatically dereferenced from itself as there could only be one single thread working on a given line at the same time.ITET)) .. !$OMP PARALLEL DEFAULT(PRIVATE) !$OMP& SHARED [. an array with self data references has been converted into a matrix and indexed with the working thread number.

. 53 AUS=SIGN(1. LATO) + AUS ENDDO [. INUM_TH #else III=1 ..2)) ) LATO=ABS(LATO) #ifdef _OPENMP INUM = omp_get_thread_num()+1 #else INUM = 1 #endif AMUDUAW(INUM . DO ILATO=1. NEDGE #ifdef _OPENMP DO III=1. the original array is rebuilt with a simple loop on the number of generated threads (known in INUM TH).LATO) = AMUDUAW + (INUM.1)) .LATO)* & ( FMDUA(MATFE(L.] !$OMP END DO !$OMP END PARALLEL At the end of operation.FMDUA(MATFE(L.

1 Reduced test case During development the test case was run to understand if the current implementation was providing good results.4 Results 5.4. The simulation had duration of 8 ps only and was composed of just 1000 elements (see Figure 14). 54 #endif AMUDUA(ILATO) = AMUDUA(ILATO) + AMUDUAW(III.for disabled. The following table (Table III) resumes the total execution time in seconds.ILATO) #ifdef _OPENMP ENDDO #endif ENDDO 5. in the table the label OMP stands for OpenMP. Further work has been done after these results were produced. Not surprisingly the most effective contribution to the software is the optimizations section: this is because the ability to access all the SSE extensions with the loop unrolling configuration (see 4.5) adds some SIMD execution to the software already. but it was already possible to notice some good improvements to the software.5% from the old configuration to the newer optimized MPI over Infiniband plus OpenMP environment. MPI for OpenMPI over Infiniband and OPT for optimiza- tions. It is possible to notice that the software has received a speed boost of 87. . while for each field a * stands for enabled and a .

So a more sensible comparison can only be done if the optimization element is kept constant. The sections that have been parallelized and distributed have received a speed boost. It is true that the most cumbersome code for the processor has been dutely parallelized. This explains also why the optimizations bring such an improvement. 55 TABLE III PARTIAL RESULTS OMP MPI OPT seconds * * * 133 * * . but the final software performance suffers from the presence of serial code and from the high number of small functions. 487 . * . In . it was possible to understand what was really needed to be measured and to be improved. . * 246 . * * 200 . * 186 * .2 Final test case With the analysis of the previous data.4. . . 792 . as they affect all the software without distinction. so development continued focusing on the new ratio. 5. 1062 However it is important to take in consideration what targets had this project. . but the software is composed of a high number of other functions that are either closely serialized or with very small duration time. 400 * .

56 the end. 129 . . . From the above table it is possible to understand the actual impact of the technologies used to increase the throughput of the software. * 174 . when all the most computational-expensive functions were addressed. This is very good results. it is possible to see the effect of OpenMP and MPI over Infiniband with no overhead from the other routines. because not only it is comparable to the speedup intro- duced by the optimizations. but also it out does the results obtained from the Intel FORTRAN compiler v10 (obtained through other tests) by a rough 23%. 249 The total speed improvements of OpenMP and MPI elements only correspond to a raw 76% increment. By looking at the single functions contribution more in detail. it was possible to launch the final test case with the same characteristics of before and to obtain the following results: TABLE IV FINAL RESULTS OMP MPI seconds * * 59 * .

with very little overhead and no synchronization problems. so it is normal that the overall reduction corresponds to half execution time.7 s 14.9 s 3.7 s calc mudua 12.3 s Having a look at the OpenMP section. there is a 2x factor of speed improvement. this is sensible as the code was almost split in two. As a matter of fact. and so only the small MPI overhead can influence execution.9 s 7.4 s 2. 57 TABLE V FUNCTION RESULTS Function Name Normal OpenMP MPI OpenMP+MPI calc intmudua 24.0 s 10.7 s 4. As for MPI on the other hand.1 s 1. It is interesting to notice that this effect applies perfectly when merging MPI with OpenMP. .8 s 1.0 s 1. communication time is negligible.8 s calc hdmg tet 16.1 s campo effettivo 17. there is an aggressive reduction.5 s 4. thanks to the Infiniband channel used.5 s 9.9 s 2. by a factor of 6-8x: this is a very good result as it means that the code was able to exploit every processor available to the maximum extent.

In the third place. secondly it could be possible to take advantage of FORTRAN library functions for otherwise long routines – even more for the high number of small operations repeated several times. it should be fairly easy and much convenient to add new elements that can contribute in the computation deployment. in particular OpenMP and MPI. First of all. algorithm optimization are necessary to obtain high performance. a complete review of the software must be taken into account. software analysis must continue in order to extract precise timing information from profiling and to identify the other computational-expensive functions that could receive a significant improvement from the inclusion of OpenMP and MPI directives. However there are technologies that can have a direct impact on performance. Furthermore the standard. Highly serialized software. written thoughout a mathematical model. it has been demonstrated that to achieve best results. For this reason this project shows significant room for improvement. With very little software modification and simple code analisys. Finally thanks to the high scalability of cluster system. CHAPTER 6 CONCLUSION In this thesis. it has been possible to introduce a significant improvement in the overall execution time. in fact it 58 . clean and stable environment of the GCC suite enabled accessing important optimization controls that increased the quality of the software where it was not been done with OpenMP or MPI. must be reorganized to allow better parallelization.

In fact due to the applied middleware of open standards. 59 would be possible to connect more components to the cluster using an Infiniband switch. OpenMP and MPI. porting software to other architectures and expanding its routines to use further nodes of the clusters should not be considered a complex task. . at the sole cost of some increased latency.


allowing more bits to be stored and more speed facilities for applications. While older processors could only process one data element per instruc- tion. allowing the programmer to completely avoid the eight 64-bit MMX registers “aliased” on the original floating point register stack. In contrast to the preceding MMX technology. SSE’s use of SIMD technology allows for data processing in applications such as 3D graphics to benefit greatly from the availability of extended floating point registers. without the need to touch the obsolete MMX/FPU registers. SSE integer instructions introduced with later extensions would still operate on 64-bit MMX registers be- cause the new XMM registers require operating system support (this behavior changed only with SSE4 onward). SSE registers have an increased width. 61 Appendix A STREAMING SIMD EXTENSIONS Introduced by Intel in its line of Pentium III processors. Initially eight new 128-bit registers known as XMM0 through XMM7 were added. SSE2 enables the programmer to perform SIMD math of virtually any type (from 8-bit integer to 64-bit float) entirely with the XMM vector-register file. . More precisely SSE2 adds new mathematical instructions for double-precision (64-bit) float- ing point and also extends MMX instructions to operate on 128-bit XMM registers. SIMD technology allows instructions to handle multiple data elements. SIMD technology allows for SIMD execution. SSE2 extends MMX instructions to operate on XMM registers. making processing much quicker.

the old process’s SIMD registers must be saved away. SSSE3 and SSE4 are further revisions to the architecture and introduce new operating conditions (column access to registers). new instructions (that can act on 64-bit MMX or 128- bit XMM registers and simplify implementation of DSP and 3D code) and conversion utility that avoid pipeline stalls. and the saved values of the new process’s SIMD registers must be loaded into the processor. . and will manage them properly. In a multi-tasking environment. the Streaming SIMD Extensions require support from the operating system: the SIMD registers must be handled properly by the operating system’s context switching code. When the system switches control from one process to another. The Pentium III processor prohibits programs from using the Streaming SIMD Extensions unless the operating system tells the processor at system startup time that it is aware of the SIMD registers. 62 Appendix A (Continued) SSE3.

i++){ c[i] = pow (sqrt(a[i] * b[i] / (b[i] + a[i]) ). totaltime = 0. 3). h. in the main loop a lot of mathematical functions are executed over a set of arrays. t< 10. c. . g. NULL). i < N. b. u. for (t = 0. Statistics are printed at the beginning and at the end of the program. 63 Appendix B OPENMP TEST PROGRAM The test program has been designed to simulate some computationally intensive routines of the target software. f. omp_set_num_threads(u). u). chunk) for (i=0. without creating data dependencies between the iterations. t++){ gettimeofday (&timing_start. the loop is repeated ten times obtaining a more for (u=1. d.u<=32. #pragma omp parallel for \ shared (a. t) \ schedule (guided. u++) { printf("%d threads\t". chunk) private (i. in order to obtain the total execution time the function gettimeofday() is used.

f[i] = sqrt (pow (a[i]. } . NULL).timing_start.tv_sec . d[i]). } gettimeofday (&timing_end. e[i] = pow (pow (c[i]. totaltime+= (timing_end.tv_sec) * 1000000 \ + (timing_end.tv_usec . pow (d[i].timing_start. c[i] + b[i]) ). totaltime/10).tv_usec). 64 Appendix B (Continued) d[i] = sqrt (c[i] * b[i] / (c[i] + pow (a[i]. c[i]) ). } printf("%d\n". 4) ) ). h[i] = a[i] * b[i] * c[i] * d[i] * e[i] * f[i] * g[i] * h[i].

any directive has to be declared with an entry and a closing section. DEFAULT (type) type of visibility for variables not listed before. SHARED (list) list of shared variables. . REDUCTION (operator: list) performs an out-of-order operation of kind operator on the variable list.] !$OMP END directive The first directive must be PARALLEL. and it is closed by the corresponding END PARALLEL. PRIVATE (list) list of private variables.] [. which wraps the code section that must be executed in parallel. FIRSTPRIVATE (list) list of private variables that are automatically initialized. COPYIN (list) for copying values of variable list among threads. Syntax for this directive (clause) may be any combination of the following: IF (condition) parallel execution is activated only when condition is met..... such as: !$OMP directive [clause . 65 Appendix C OPENMP FORTRAN REFERENCE The general OpenMP directive begins with !$OMP indicating the starting of an OpenMP configuration.

. SHARED (list) list of shared variables. FIRSTPRIVATE (list) list of private variables that are automatically initialized. ORDERED performs iteration in order. COLLAPSE (n) performs some loop collapsing (for n loops). which may be executed by the encountering thread. sequential style. Other parallelizing directives that don’t require any particular clause configuration are: • SECTIONS: statically splits the code into sections which are assigned each to a single thread in the pool. or deferred for execution by any other thread in the team. Another important OpenMP directive is DO that specifies the next loop can be executed in parallel by the thread team. Syntax follows: SCHEDULE (type [. chunk )] describes how iterations of the loop are divided (in chunk s) among the threads. 66 Appendix C (Continued) NUM THREADS (num) statically set the number of threads to generate. PRIVATE (list) list of private variables. LASTPRIVATE (list) list of private variables that are initialized when iteration ends. REDUCTION (operator | intrinsic : list) performs an out-of-order operation of kind op- erator (or intrinsic function) on the variable list. • WORKSHARE: divides the execution of the enclosed code block into separate units of work. • TASK: defines an explicit task.

setting configuration about the number of threads. • OMP GET NUM PROCS(): returns the number of processors that are available to the program. For example: • OMP SET NUM THREADS(): sets the number of threads that must be started. which don’t need any other clause as well: • MASTER: specifies a region of code that is executed only by one thread. • ATOMIC: defines a single instruction critical region. • OMP GET NUM THREADS(): returns the number of threads of the parallel region. • BARRIER: implements a barrier region where execution is stopped until all threads are ready to continue. locking variables. 67 Appendix C (Continued) As for synchronization management. • CRITICAL: identifies a critical region in which only one thread at a time can access. • OMP GET THREAD NUM(): returns the number identifying a single thread in the pool. Finally it is possible to use some OpenMP related functions to further adapt the software to a multiprogrammed system. this set of routines may be used for a variety of application such as obtaining information from single threads. timing and so on. getting environment data (like number of processors). it is possible to find the following directives. • OMP GET THREAD LIMIT(): returns the maximum number of OpenMP threads available to a program. in which memory is accessed atomically from all the threads. .

• OMP UNSET LOCK(): unsets the lock on the given variable. • OMP SET LOCK(): sets the lock on the given variable. • OMP TEST LOCK(): tests the lock on the given variable. • OMP DESTROY LOCK(): eliminates the lock on the variable. 68 Appendix C (Continued) • OMP INIT LOCK(): initializes a lock on the variable. setting the lock to “unset”. .

h library. MPI_COMM_WORLD. and they are all described by the following: call MPI_SEND(buffer. 69 Appendix D MPI FORTRAN REFERENCE MPI routines are added to a standard FORTRAN program by including the mpif. tag. before ending the program. the MPI layer must be initialized with MPI INIT(). ierr) call MPI_RECV(buffer. before using any MPI related functions. destination. rank. and the total number of instances created in numtasks. stat. type. synchronous. MPI_COMM_WORLD. After this. source. buffered. and it must be closed with MPI FINALIZE(). . respectively. By using: call MPI_COMM_RANK(MPI_COMM_WORLD. Then now it is possible to use point-to-point communication routines that are present in many variants. quantity. ierr) call MPI_COMM_SIZE(MPI_COMM_WORLD. tag. quantity. ierr) the program becomes aware of running in a MPI environment. as the reported functions save the number of the instance of the program in the rank variable. ierr) The syntax for the equivalent functions is similar: buffer : represents either data that has be sent or the memory location in which it must be saved. type. non-blocking. numtasks. like blocking.

stat : represents the status of the transfer. In order to compile an MPI-enabled program. but it is necessary to resort to the wrapper of the MPI distribution. • MPI GATHER() a message is received from all the nodes. • MPI SCATTER() a message is split and sent to all the nodes. sending a distinct message to all the tasks in the group that require information about the data buffers of both the sender and the receiver. type : sets one of the high MPI data types for the transfer. tag : identifies the message number that must be sent or received. which correctly set paths and libraries. . MPI allows also for collective communication (a sort of “multicasting”) by means of functions such as: • MPI BCAST() a message is sent to all the nodes. • MPI ALLTOALL()Each task in a group performs a scatter operation. MPI COMM WORLD : reads from the macro in which MPI configuration is saved. ierr : is the error variable in case communication fails. destination/source : describes the number of the instance of the program that has to send or receive the buffer. it is not possible to directly call the compiler. also for launching executables a special wrapper must be used with proper syntax. 70 Appendix D (Continued) quantity : tells how much data of type is sent in the message.

71 Appendix D (Continued) In case of OpenMPI. this software must be called specifying the number of instances of the program to run (-np) and the list of hosts that have to execute it (-host).host2 -x OMP NUM THREADS program [args] . So for example in a two-machine cluster environment in which each node has to execute an instance of the program. The resulting command line instruction is: $ mpirun -np 2 -host host1. the compiler is called mpif90 while the launching wrapper is mpirun. the correct syntax is $ mpirun -np 2 -host host1. this is required for an OpenMP+MPI system as the number of threads depends on the value of OMP NUM THREADS.host2 program [args] It is possible to share some environment variables among the nodes with the -x switch. the MPI implementation selected for this project.

: Computer Architecture: A Quantitative Approach. LWN. 1990. G. 1995. Politecnico di Torino and Electronic and Information En- gineering Department..: Finite formulation of nonlinear magnesto- statics with integral boundary conditions.: Computer Organization & Architecture .. and Amemiya.. Technical Report 1051-8223/04. Kanazawa University Japan. N. L.. http://www. and Yamada. McDonald. D. Technical Report 0018-9464. F. 8. Ito. 72 .net/Articles/192082/.: Application of parallelized SOR method to electromagnetic field analysis of superconductors.Your Linux info source..Designing for Performance.. Barney.html. McMichael. Griffith University Australia and Faculty of Technology. Pearson Education.nist. and Patterson. Lu. Faculty of Engineer- ing. R. 3. Y.: Parallel Programming in OpenMP. 6.. llnl. Faculty of Science and Technology. 2. R. B. Dagum. 9. and Menon.: OpenMP. R. Pear- son . C. W. Galvin. D. D. Politecnico di MIlano. mumag.. Morgan Kaufmann Publishers. M.: µMAG – Micromagnetic Modeling Activity Group. 10. B. Hennessy.ctcms. P. J.. Electrical Engineering Department. Tech- nical Report 0-7803-2018-2. Kohr. 4.: Operating System Concepts. Gruosso. D. Stallings. J. and Repetto. J. Moser. Giuffrida. 2006. Li. . R. Morgan Kaufmann. 7.. Maydan.Prentice Hall. and Gagne. Yokohama National University. A.: Optimizing linker load times. J. 2001. L. G. C. CITED LITERATURE 1. http: //lwn.. Center for Theo- retical and Computational Materials Science. Sun.. 5. A. 2004. 2006. Lawrence Livermore National Laboratory. S. Chandra.: A parallel computation model for nonlinear electromagnetic field analysis by harmonic balance finite element method. https://computing.

P.: High Performance Cluster Computing: Architectures and Systems. U. 1998. Intel Press.. L. M. Reinders. 21.: Using MPI .: High Performance Cluster Computing: Programming and Applications. of Computer Science. Los Alamos National Laboratory and Dept. L. D.. 1999.: Infiniband scalability in Open MPI. S. S.: OpenMP: An Industry Standard API fo Shared Memory Programming. S. 2007.. B.. W. Yao. Gropp. D. R. Computational Science and Engineering.. R. 2006. J. R. J. E. D.. The MIT Press. R. Pir- raglia.. T. 12. N.-N. 14. T. 1997. 18. 15.: VTuneTM Performanc Analyzer Essentials. R. G. Scientific and Engineering Computation Series. O. J. 16. B. 2007.. B. Maccabe.. and Lascu.. Pizzano. Hein. Koop. Addison-Wesley Professional Computing Series. Desjarlais. . and Menon. G. and Panda. Nguyen.. R. S. Department of Computer Science Engineering. June 2005. Advanced Computing Laboratory. Ohio State University. EPCC.. R. Lusk.-E. 2006. Shipman. and Bridges. M. 19. Technical report. F. S.-H. Kahle.: UNIX Network Programming: Networking APIs: Sockets and XTI. Dagum. 73 CITED LITERATURE (Continued) 11. Univeristy of Edinburgh. Gray. Technical Report 1-4244-0054-6/06... L. Pren- tice Hall. A. 1999.. Woodall. Technical Report 0-7695-2700-0/06. The Computer Journal.: Multithreaded processors. Butenhof. and Booth. 1999.. A. Kim. 3.: Programming with POSIX R Threads. R. M. T. Quintero. J. 2002. R. University of New Mexico.. Stevens.Portable Parallel Programming with the Message-Passing Interface. A. 1998. W. Prentice Hall. 17.. Simon. B. 20.. IBM Redbooks.: Improved MPI with RDMA.. Sur.. Prentice Hall. 22...: High-performance and scalable MPI over In- finiband with reduced memory usage: An in-depth performance analysis. K.: Implementing InfiniBand on IBM System p. H.. 13. and J. and Skjellum. Conrad. Graham.

V. Genaud. 24..-B.-C. B. Technical Report 1-4244-1388-5. Mignot.: Comparison and tuning of MPI implementations in a grid context. 74 CITED LITERATURE (Continued) 23.: Message Passing Interface (MPI). P. Laboratoir de l’Informatique du Parallelisme. Gluck. S. Universite de Lyon. https://computing. O.llnl. J. 2007. and Primet. L.. Barney.. Lawrence Livermore National Laboratory. .

under the advising of professors Bartolomeo Montrucchio and Carlo Ragusa Master of Science in Electrical and Computer Engineering. TOBO. Turin. Italy.scribd. 2004 BTP certification. Illinois. a collection of academic essays and homework reports publically 75 . VITA NAME: Vittorio Giovara EDUCATION: B. http://www. University of Illinois at Chicago. Turin. 2007 M. 2008 PROFESSIONAL: Project manager for GLE-MiPS. a VHDL description for processor architecture. Turin. Turin. Turin.. http://www. Chicago. managing the Mac OS X and iPhone Editor and author for ProjectSymphony. Italy. equiv. focusing on the educational implementation. Computer Engineering. a strategy game. Politecnico di Torino.Sc. http://gle-mips.hedgewars. Italy. XX Winter Olympics. Italy. Developer of Hedgewars. equiv. Cambridge University. under the advising of professors Bartolomeo Montrucchio and Zhichun Zhu HONORS: PROFICIENCY Certificate in English. 2009. 2006 TOP-UIC Fellowship. Computer Engineering.Sc. Politecnico di Torino.googlecode. Italy. Politecnico di Torino.