Synchronization types • Barrier • Each task performs its work until it reaches the barrier. It then stops, or "blocks". • When the last task reaches the barrier, all tasks are synchronized. • Lock / semaphore • The first task to acquire the lock "sets" it. This task can then safely (serially) access the protected data or code. • Other tasks can attempt to acquire the lock but must wait until the task that owns the lock releases it. • Can be blocking or non-blocking Load balancing • Keep all tasks busy all of the time. Minimize idle time. • The slowest task will determine the overall performance. Achieving Load Balancing • Equally partition the work each task receives • For array/matrix operations where each task performs similar work, evenly distribute the data set among the tasks. • For loop iterations where the work done in each iteration is similar, evenly distribute the iterations across the tasks. • If a heterogeneous mix of machines with varying performance characteristics are being used, be sure to use some type of performance analysis tool to detect any load imbalances. Adjust work accordingly. Achieving Load Balancing • Use dynamic work assignment • Certain classes of problems result in load imbalances even if data is evenly distributed among tasks. For example: • Sparse arrays - some tasks will have actual data to work on while others have mostly "zeros". • Adaptive grid methods - some tasks may need to refine their mesh while others don’t. • N-body simulations - particles may migrate across task domains requiring more work for some tasks.
Sparse arrays Adaptive grid N-body
Achieving Load Balancing • Use dynamic work assignment • When the amount of work each task will perform is intentionally variable, or is unable to be predicted, it may be helpful to use a scheduler-task pool approach. As each task finishes its work, it receives a new piece from the work queue. • Ultimately, it may become necessary to design an algorithm which detects and handles load imbalances as they occur dynamically within the code Granularity • In parallel computing, granularity is a qualitative measure of the ratio of computation to communication. • Fine-grain Parallelism: Relatively small amounts of computational work are done between communication events Low computation to communication ratio Facilitates load balancing Implies high communication overhead and less opportunity for performance enhancement If granularity is too fine it is possible that the overhead required for communications and synchronization between tasks takes longer than the computation. • Coarse-grain Parallelism: Relatively large amounts of computational work are done between communication / synchronization events High computation to communication ratio Implies more opportunity for performance increase Harder to load balance efficiently Parallel Computing Performance Metrics • Let T(n,p) be the time to solve a problem of size n using p processors, then:
Speedup: S(n,p) = T(n,1)/T(n,p)
Efficiency: E(n,p) = S(n,p)/p
Scaled Efficiency: SE(n,p) = T(n,1)/T(pn,p)
• An algorithm is scalable if there exists C>0 with SE(n,p) ≥ C for
all p. Scalability • Two common types of scaling of software are: 1. Strong scaling and 2. weak scaling. • Strong scaling concerns the speedup for a fixed problem size with respect to the number of processors, and is governed by the Amdahl’s law. • Weak scaling concerns the speedup for a scaled problem size with respect to the number of processors, and is governed by the Gustafson’s law. Scalability is important for parallel computing to be efficient. When using HPC clusters, it is almost always worthwhile to measure the scaling of your jobs. The results of strong and weak scaling tests provide good indications for the best match between job size and the amount of resources that should be requested for a particular job Strong Scalability • Strong Scalability: Strong scaling depicts how the solution time varies with the number of processors for a fixed total problem size. That is, what happens with the run time when the number of processors are increasing for a fixed problem size. The total problem size stays fixed as more processors are added. Goal is to run the same problem size faster Perfect scaling means problem is solved in 1/P time (compared to serial) Weak Scalability • Weak Scalability: Weak scaling indicates how the solution time varies with the number of processors for a fixed problem size per processor. That is what happens with the increase of number of processors for a scaled problem size i.e. using a fixed problem size per processor. The problem size per processor stays fixed as more processors are added. The total problem size is proportional to the number of processors used. Goal is to run larger problem in same amount of time Perfect scaling means problem Px runs in same time as single processor run Amdahl’s Law • If P is the parallel portion of code, then the maximum Speedup = 1/(1-P). • If none of the code can be parallelized i.e. P = 0, then speedup = 1 i.e. no speedup, how much processors are used. • If all of the code is parallelized, that is P = 1, then the speedup is infinite (in theory). • If 50% of the code can be parallelized that is P = 0.5, then maximum speedup = 2, meaning the code will run twice as fast. • In terms of number of processors N performing the parallel fraction P of work, the Speedup can be rewritten as: Speedup = 1/((P/N)+S), where N is the no. of processors and S is the serial portion of code. Speedup N P=0.5 P=0.9 P=0.99 10 1.82 5.26 9.17 100 1.98 9.17 50.25 1000 1.99 9.91 90.99 10000 1.99 9.91 99.02 100000 1.99 9.99 99.90 Gustafson’s law Amdahl’s law gives the upper limit of speedup for a problem of fixed size which is a bottleneck for parallel computing and discourages the use of parallel computing. For example, if a gain of 500 times speedup on 1000 processors is required, then according to Amdahl’s law, the proportion of serial part cannot exceed 0.1%. On the other hand the Gustafson law, says that the problem size also shall scale with the amount of available resources. That is, if a problem only requires a small amount of resources, it is not beneficial to use a large amount of resources to carry out the computation. A more reasonable choice is to use small amounts of resources for small problems and larger quantities of resources for big problems. Gustafson law says that if the parallel part scales linearly with the amount of resources, and that the serial part does not increase with respect to the size of the problem then the scaled speedup is defined as: scaled speedup = S + P × N • where S, P and N have the same meaning as in Amdahl’s law. With Gustafson’s law the scaled speedup increases linearly with respect to the number of processors, and there is no upper limit for the scaled speedup. Gustafson’s law scaled speedup = S + P × N
• According to Gustafson’s law,
For S= 0.05 and P = 0.95, the scaled speedup will become infinity when infinitely many processors are used. Realistically, if we have N = 1000, the scaled speedup will be 950. Table 1: Strong scaling for Julia set generator code Strong Scaling Example height width threads time Speedup 10000 2000 1 3.932 sec 1.00 10000 2000 2 2.006 sec 1.96 10000 2000 4 1.088 sec 3.61 10000 2000 8 0.613 sec 6.41 10000 2000 12 0.441 sec 8.94 10000 2000 16 0.352 sec 11.23 10000 2000 24 0.262 sec 15.01
Problem Size is fixed,
number of threads being increased
Figure 1: Plot of strong scaling for Julia set generator code
The dashed line shows the fitted curve based on Amdahl’s law. Table 2: Weak scaling for Julia set generator code Weak Scaling Example height width threads time 10000 2000 1 3.940 sec 1.00 20000 2000 2 3.874 sec 2.03 40000 2000 4 3.977 sec 3.96 80000 2000 8 4.258 sec 7.40 120000 2000 12 4.335 sec 10.91 160000 2000 16 4.324 sec 14.58 240000 2000 24 4.378 sec 21.60
Problem Size is not fixed,
number of threads being increased
Figure 2: Plot of weak scaling for Julia set generator code
The dashed line shows the fitted curve based on Gustafson’s law. inter-task communications Factors • Communication overhead • Inter-task communication virtually always implies overhead. • Machine cycles and resources that could be used for computation are instead used to package and transmit data. • Communications frequently require some type of synchronization between tasks, which can result in tasks spending time "waiting" instead of doing work. • Competing communication traffic can saturate the available network bandwidth, further aggravating performance problems • Latency vs. Bandwidth • latency is the time it takes to send a minimal (0 byte) message from point A to point B. Commonly expressed as microseconds. • bandwidth is the amount of data that can be communicated per unit of time. Commonly expressed as megabytes/sec or gigabytes/sec. • Sending many small messages can cause latency to dominate communication overheads. Often it is more efficient to package small messages into a larger message, thus increasing the effective communications bandwidth. inter-task communications Factors • Visibility of communications • With the Message Passing Model, communications are explicit and generally quite visible and under the control of the programmer. • With the Data Parallel Model, communications often occur transparently to the programmer, particularly on distributed memory architectures. The programmer may not even be able to know exactly how inter-task communications are being accomplished. • Synchronous vs. asynchronous communications • Synchronous communications require some type of "handshaking" between tasks that are sharing data. This can be explicitly structured in code by the programmer, or it may happen at a lower level unknown to the programmer. • Synchronous communications are often referred to as blocking communications since other work must wait until the communications have completed. • Asynchronous communications allow tasks to transfer data independently from one another. For example, task 1 can prepare and send a message to task 2, and then immediately begin doing other work. When task 2 actually receives the data doesn't matter. • Asynchronous communications are often referred to as non-blocking communications since other work can be done while the communications are taking place. • Interleaving computation with communication is the single greatest benefit for using asynchronous communications. inter-task communications Factors • Scope of communications • Knowledge which tasks must communicate with each other is critical during the design stage of a parallel code. Both of the two scopings described below can be implemented synchronously or asynchronously. • Point-to-point - involves two tasks with one task acting as the sender/producer of data, and the other acting as the receiver/consumer. • Collective - involves data sharing between more than two tasks, which are often specified as being members in a common group, or collective. Some common variations (there are more): Performance Model for distributed memory • Time for n floating point operations: tcomp = fn , where 1/f Mflops • Time for communicating n words: tcomm = α + βn, where, α latency, 1/β bandwidth • Total time: tcomp + tcomm