You are on page 1of 73

Evaluating Threading Building Blocks Pipelines

Sunu Antony Joseph

Master of Science Computer Science School of Informatics University of Edinburgh 2010

Abstract
Parallel programming is the need in the muti-core era. With many parallel programming languages and libraries developed with the aim of providing higher levels of abstraction, which allows programmers to focus on algorithms and data structures rather than the complexity of the machines they are working on, it becomes difficult for programmers to choose the right programming environment best suited for their application development. Unlike serial programming languages there are very few evaluations done for parallel languages or libraries that can help programmers to make the right choice. In this report we evaluate Intel Threading Building Blocks library which is a library in C++ language that supports scalable parallel programming. The evaluation is done specifically for the pipeline applications that are implemented using filter and pipeline class provided by the library. Various features of the library which help during pipeline application development are evaluated. Different applications are developed using the library and are evaluated in terms of their usability and expressibility. All these evaluations are done in comparison to POSIX thread implementation of different applications. Performance evaluation of these applications are also done to understand the benefits threading building blocks have in comparison to the POSIX thread implementations. In the end we provide a guide to future programmers that will help them decide the best suited programming library for their pipeline application development depending on their needs.

i

Acknowledgements
First, I would like to thank my supervisor, Murray Cole, for his guidance and help throughout this project and mostly for the invaluable support in difficult times of the project period. I would also like to thank my family and friends for standing always by me in every choice I make.

ii

Declaration I declare that this thesis was composed by myself. that the work contained herein is my own except where explicitly stated otherwise in the text. and that this work has not been submitted for any other degree or professional qualification except as specified. (Sunu Antony Joseph) iii .

.. iv .To my parents and grandparents.

. . . . .1 2. . . . . . . .3 Fast Fourier Transform . . . . . . . Usability. . . . . . . . . . . . . . . . . . . . . . . .3 1. . . . . 1 3 4 5 6 8 8 8 9 10 11 12 13 13 14 14 15 15 16 16 18 18 19 19 20 2 Background 2. . . . . . . . . . . . . . Threading Building Blocks Pipeline . . . . . . . . . . . . . 2. . . Nested parallelism . . . . .1 2. . . .2 2. . . . . . . Setting the number of threads to run the application . . . . .3 3. . . Bitonic sorting network . . . . . . . . . . .2 4. . . . . 4 Design and Implementation 4. . . . .Table of Contents 1 Introduction 1. . . . . . . . . . . . . . . . . . Motivation .2 3. . Parallel Design Patterns . . . . . . . . . . . . . . . Goals . . . . . . . Performance and Expressibility . . . . . . Intel Threading Building Blocks . . .1 3. . . . . . . . . . . . .1 Pipeline Pattern . . . .1. Task and Data Parallelism . . . . . .1.3 Parallel Programming Languages . . . . . . . . . . . . . . . 3 Issues and Methodology 3.3. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .1 Selection of Pipeline Applications . .2 2. . . . 4. . . . . . . . . . . . . . . . . . .4 3. . . . .1 4. . . . . Filter bank for multi-rate signal processing . . . . . . . . . . . . .5 Execution modes of the filters/stages . . . .2 1. . . . . . . . . . . . . . . .4. . . . . .4 Related Work . . . Pipeline using both data and task parallelism . . . . . . . . Setting an upper limit on the number of tokens in flight . . . . . . . . . . . . . . . . .1 1. . . . . POSIX Threads . . . . . . . . . . . . . . . . . . . v . . . .5 2.3. . . . . . . . . .1. . . . . . . Thesis Outline . . .4 2. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . Expressibility .3 6. . . . . . . . . .1 5. . .3 5. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .2 4. . . . Application Design . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .1 4. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .4. . . . . .2 5. . . . . . . . . . . . . . . 4. . . . . . . . . . .2 5. . . . . . Expressibility . . .1. .2. . . . . . . . . . . . . . . .2 5. . . . Usability . . . . . . . . . . . . . . .2 Application Design . . . . . . . . .3 Filter bank for multi-rate signal processing . . . . . . Fast Fourier Transform Kernel . . . Performance .3. . . . . . . . . . .3 5. . . . . . .7 Usability . .2. . . . . . . 20 20 21 26 26 26 28 28 29 32 32 33 34 35 37 38 39 39 41 41 42 42 44 44 46 47 50 53 54 54 54 55 56 57 4. . . Experience of the programmer . . . .4 5. . . . . . . . . . . . Feature 1: Execution modes of the filters/stages . .1. . . . Performance . .2 5.4 Bitonic sorting network . . . . . . . . . . . . . . Performance . . . . . .1 5. . . . . . . . . . . . . . . . . . . . . . .2. . . . . . . . Parallel Filters . . . . . Design of the application . . .1 5. . . . .5 Overview . . . . . . . . . . . . . . . .4. . . . . . . . . . .2. . . . . . . . . . . . . .1 Bitonic Sorting Network . . Expressibility . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .1 5. . . . . . . . . . Feature 3: Setting an upper limit on the number of tokens in flight . . . . . . .3 5. . . . . . . . . .3.1 4. . . . . . . . . . serial out of order and serial in order Filters .1. . . . . . . . . . . . . . . . . . .4 6. .1 6. . . . .2 Fast Fourier Transform Kernel . .2. . . . . . . . Filter bank for multi-rate signal processing . Feature 2: Setting the number of threads to run the application . . . . .1 4. . . . 4. . . . . . . . . . . .4.3 5. . . . . . Feature 4: Nested parallelism . . . Application Design . . . Implementation . . . . . . . . . . . . . . . . . . . . 4. . . vi . . . . . . . Implementation . . . . . . . . . . Performance . . . . . .5 5.4. . .2 5. .3. . . . . . .3. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6 Guide to the Programmers 6. . . . . . Implementation . . . . . . . .2 5 Evaluation and Results 5.4. . . .2 6. . . . Usability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Scalability . .3.6 5. . .

. . . . .1 7. . . . . . . . . . . . . . . . . . 57 57 58 58 58 60 61 Future Work and Conclusions 7. . . . . .6. . . .3 Overview . . . . . . . . . . . . . . .7 7 Load Balancing . .6 6. . . . . . . . . . .2 7. . . . . . . . . . . . . . . . . . . Conclusions . . . . . . . . Bibliography vii . . . . . . . . . Future Work . Application Development Time . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

. Performance of the Bitonic Sorting application(pthread) with and without Locks. . . . . . . . . . . . . 5. Data Parallelism . . . . . Using the Hybrid approach. . . . . . . . . . . . .4 2. . . . . . . . . 2 9 9 10 10 11 12 20 21 4. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .2 5. . . . . .1 4. . . . .7 Performance of the Bitonic Sorting application(TBB) on machines with different number of cores. . . . . . . . . . . . . . . . . . . . . . . . Structure of the Fast Fourier Transform Kernel Pipeline implemented in TBB. . . . . .2 2. . . . . . . . Performance of the Filter Bank application(pthread) on machines with different number of cores. . . . . . . . . . . . . . . . . . . . . . . .4 5. . . . . . Performance of the Filter Bank application(TBB) on machines with different number of cores. . . . .2 Structure of the Fast Fourier Transform Kernel Pipeline. Performance of the Bitonic Sorting application(pthread) on machines with different number of cores. . . . . . . . . . . . . . . . . . . . . . . . . . . .3 5. . . Task Parallelism . . . . . . . . . . . .List of Figures 1. . . . . . . . 44 43 40 40 37 36 36 viii . . . . . .1 2. . . . . Performance of the Fast Fourier Transform Kernel application(pthread) on machines with different number of cores. . . . . . . . . . . . . . . . .5 2. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .6 5. .5 5. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Flow of tokens through the stages in a pipeline along the timeline. . . . .6 Parallel Programming Environments[2] . Multiple workers working on multiple data in stage 2. . . . . . . . Performance of the Fast Fourier Transform Kernel application(TBB) on machines with different number of cores. Linear Pipeline Pattern . . . . . . . . . . . .3 2. . . Non-Linear Pipeline Pattern . .1 2. . . .1 5. . . . . . . . . . . . . . .

. . . . . . . . . . . .11 Performance of Bitonic Sorting Network varying the number of threads in execution. . . . . . . . . . . . . . . . . . . . . . . . . . . 5. . . . . . . . .9 Latency difference for linear and non-linear implementation assuming equal execution times of the pipelines stages. . . . . . . . . .15 Performance of the Filter Bank application different input sizes and number of cores of the machine.17 Performance of Fast Fourier Transform varying the limit on the number of tokens in flight. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .18 Performance of Filter Bank varying the limit on the number of tokens in flight. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .5. . Performance of Filter bank application(TBB) for different modes of operation of the filters. . . 53 52 52 51 50 50 5. . . . . . . . . . 5. . . . . . . . . . . . . . . . . . . . . . . . . 5. . . . . . . . 5. 5. . 5. . . . . . . . . . .19 Performance of pthread applications varying the limit on the number of tokens in flight. . . . 46 47 48 48 45 5. . .8 5.12 Performance of Fast Fourier Transform Kernel varying the number of threads in execution. . . . . . .13 Performance of Filter Bank varying the number of threads in execution. . 5. . . . . . . . . . . . . . .16 Performance of Bitonic Sorting Network varying the limit on the number of tokens in flight. . . . . . . . . . 49 ix . . . . . . . . . . . .14 Performance of the FFT Kernel for different input sizes and number of cores of the machine. . . .10 Performance of Filter bank application with stages running with data parallelism. . . . . . . . . . . . . . . . . . . . . . . . . . 5. . . . . . . . . . . . . . . . . . . . . . . .

. . . . .List of Tables 4. . . . . . . . . . . .1 Applications selected for the evaluation. 19 x .

In sequential programming languages like conventional C/C++. Parallel programs execute non-deterministically. asynchronously and with the absence of total order of events. it is assumed that the set of instructions are executed sequentially by a single processor whereas a parallel programming language assumes that simultaneous streams of instructions are fed to multiple processors. The clock frequencies of chips are no longer increasing making parallelism the only way to improve the speed of the computer. But those days are over and the next generation chips will have more processors with each individual processor not being faster than the previous years model [8]. The ultimate challenge of parallel programming languages is to overcome these issues and help developers to develop software with the same ease as in the case with serial programming languages.Chapter 1 Introduction Programmers have got used to serial programming and expect the programs they develop to run faster with the next generation processor that have been coming out in the market in the past years. so they are hard to test and debug. Parallel programming languages or libraries should help software developers to write parallel programs that work reliably giving good performance. The idea of converting serial programs to run in parallel can be limiting since its performance may be much lower than the best parallel algorithms. The transition to the multithreaded applications is inevitable and leveraging the present multithreading libraries can help the developers in threading their application in many ways. libraries and environments have been de1 . Parallel computer systems are useless without having parallel software to utilise their full potential. Multithreading aims at exploiting the full potential of multi-core processors. Several parallel programming languages. The parallel programs developed should also scale with the additions of more processors into the hardware system.

Parallel programming patterns gives software developers a language to describe the architecture of parallel software. testing and maintaining large parallel codes.1: Parallel Programming Environments[2] One of the most promising techniques to make parallel programming available for the general users is the use of parallel programming patterns. An interesting parallel programming framework that provides templates for the . Proponents of each approach often point out various language features that are designed to provide the programmer with a simple programming interface.Chapter 1. All design patterns are structured description of high quality solutions to recurring problems. So the next question in mind is why have these parallel programming languages not been so productive? Why are there only a small fraction of the programmers that write parallel code? This could be because of the hassle of designing. In the past two decades there were many parallel programming environments developed. As a result. But then why are programmers hesitant to go parallel? For large codes. Figure 1. users will often choose a particular multiprocessor platform based not only on absolute performance but also the ease with which the multiprocessor may be programmed [18]. the cost of parallel software development can easily surpass that of the hardware on which the code is intended to run. The use of design patterns promotes faster development of structurally correct parallel programs. Introduction 2 veloped to ease the task of writing programs for multiprocessors. developing. debugging.

The evaluation is done is comparison to the conventional POSIX2 thread library. In this report we evaluate the pipeline template provided by the Intel threading building blocks library. Mining and Synthesis (RMS) workloads [12].ieee. A common parallel programming pattern is the pipeline pattern. 1. Threading Building Blocks uses templates for common parallel patterns.org/regauth/posix/ . Intel Thread Building Blocks1 is a library in C++ language which was built on the notion to separate logical task patterns from physical threads and to delegate task scheduling to the multi-core systems [2].threadingbuildingblocks. Development of usable and efficient parallel programming systems have received much attention from across the computing and research community. The functional pipeline parallelism is a pattern that is very well suited and used for many emerging applications. applicability and usability of many different parallel programming systems. We evaluate the pipeline class in terms of its usability. Szafron and Schaeffer in their paper [17] evaluate parallel programming systems focusing on 3 primary issues of performance. The main intention of the project is to provide a guide to future programmers about Intel Threading Building Blocks pipelines and also to provide a comparative analysis of the TBB pipelines with the corresponding pipeline implementations using POSIX thread. They perform 1 www. We implemented various features that the threading building block library provides in pthread to understand how much easier the threading building blocks library make the programmer’s job during pipeline application development.Chapter 1. A controlled experiment was conducted in which half of the graduate students in a parallel/distributed computing class solved a problem using the Enterprise parallel programming system [16] while the rest used a parallel programming system consisting of a PVM[5]-like library of message passing calls (NMP [9]). Introduction 3 common patterns in parallel object-oriented design is Threaded Building Blocks (TBB). such as streaming and Recognition.org/ 2 http://standards.1 Related Work The growth in commercial and academic interest in the parallel systems has seen an increase in the number of parallel programming languages and libraries. This comparative evaluation is based on many experiments that we conducted on both the parallel programming libraries. expressibility and performance.

completeness of the programming environment. development time. login hours etc. When experimented with novices the primary aim was to to measure how quickly they can learn the system and produce correct programs. to measure the usability of the parallel programming systems. They conclude by saying that traditional software complexity metrics are effective indicators of the relative complexity of parallel programming languages or libraries. Another work done to analyse the usability of Enterprise parallel programming system was done by Wilson. Before this work there were not many comparable work done on parallel programming systems other than by Rao. They use McCabes Cyclomatic program complexity [10] and noncommented source code lines [6] to quantify the relative complexity of the several parallel programming languages. expressibility and usability of the library. Segall and Vrsalovic [15] where they developed a level of abstraction called the implementation machine. We need .2 Goals The primary goal of the project is to evaluate Intel Threading Building Blocks Pipelines and to understand if the library is best for pipeline application development in terms of its performance. When experimented with experts. program runs. They perform two experiments in the first one they measure the ease with which novices can learn the parallel programming systems and produce correct. Another notable work is by VanderWiel. They borrow techniques from the software engineering field to quantify the complexity of the three prominent progamming models: shared memory. Introduction 4 an objective evaluation during system development that gave them valuable feedback on the programming model.Chapter 1. The implementation machine layer provides the user with a set of commonly used parallel programming paradigms. ease of use and learning curves. programs and in the second experiment they measure the productivity of the system in the hands of an expert. They collected statistics like number of lines of code. but not necessarily efficient. message passing and High performance Fortan. compiles. scalability. the primary aim was to know the time it takes to produce a correct program that achieves a specified level of performance on a given machine. They analyse how the implementation machine template helps the user in implementing the chosen implementation machine efficiently on the underlying physical machine. Schaeffer and Szafron [22]. 1. Nathanson and Lilja [18] where they evaluate the relative ease of use of parallel programming languages. number of edits.

Introduction 5 to objectively compare parallel programming languages for the pipeline applications.Chapter 1. Many parallel programming languages are developed with the aim to simplify the task of writing parallel programs that run on multi-core machines and there is very little data that tells us about the complexity of the different parallel programming languages. and highlight it to the developers so that they can take the right decision to select the apt language/library for their application development. this information will be helpful for them to take the right choice during their pipeline application development. The Evaluation of Intel Threading Building Blocks pipeline class is done in comparison to the conventional POSIX thread implementations. 1. The usability tests have to be done considering a novice parallel programmer so that at this time when software developers are making the transition to parallel programming. Such experiments are necessary to help narrow the gap between what parallel programmers want and what current parallel programming systems provide[17].3 Motivation There have been many usability experiments conducted for serial programming languages but very few for the parallel languages or libraries. Finally we evaluate the library in terms of scalability and performance of the pipeline applications developed using the library. The evaluation includes understanding of the usability of Intel Threading Building Blocks which is one of the factors that attract developers to use the library in the complex world of parallel programming. It is necessary to understand the pros and cons. There are many parallel programming languages/libraries developed and each of these languages/libraries may have its own pros and cons for the development of different applications. Since design patterns are the solutions to . This can be done by understanding the needs of pipeline application developer during their pipeline application development and use those as the criteria to evaluate Intel Threading Building Blocks Pipelines. Further we look into the expressibility of the library by understanding how suitable the library is for different pipeline applications which will help future developers understand how the library adapts to the different variety of pipeline patterns that are commonly used for pipeline application development. Many features provided by the library is put into test to understand how helpful it is to a pipeline application developer. This will help pipeline application developers to understand the performance drawbacks or benefits of using Intel TBB for their pipeline application development.

test and debug.4 Thesis Outline The thesis has been divided into 7 chapters including this chapter which is the introduction. performance or scalability of the programs developed. * Chapter 6 is a Guide to the future programmers that will help them choose between threading building blocks and pthread library for their pipeline application development depending on their needs. * Chapter 5 discusses about the evaluations and the results obtained.Chapter 1. Hence. 1. Parallel programming languages that are developed to make the job of parallel programming easier may at times have drawbacks in terms of expressibility. the usability of the parallel programming languages play a very important role in the success of the programming language. Parallel programs are hard to develop. performance and expressibility of threading building blocks in comparison to the pthread library. categorising pros and cons of different programming languages/libraries on the different design patterns is a good way to provide the information to the developers. * Chapter 3 discusses the various features of threading building blocks that help during pipeline application development and the methodology used for the evaluation. Usability is definitely a factor that programmers will be looking for when developers are going for the development of applications under fast approaching deadlines or when its a novice programmer who is trying to enter into the parallel programming world and does not have much idea of the multi-threading concepts. Introduction 6 the commonly seen problems. It also discusses about how we evaluate threading building blocks in terms of usability. The remaining chapters are laid out as follows: * Chapter 2 gives an overview about the basic concepts and terminologies that help to understand the report better. * Chapter 4 focuses on the design and implementation of the various applications that are developed in threading building blocks and pthread library. Understanding the languages in terms of these factors is really important for the proper evaluation of the parallel language/library. .

. Introduction 7 * Chapter 7 draws an overall conclusion and suggests future work.Chapter 1.

Another point to be noted is that the data parallelism 8 . Intel Threading building blocks is an example of shared memory programming. OpenMP and POSIX Threads are two widely used shared memory APIs.shared memory. We discuss about parallel programming languages.2 Task and Data Parallelism Data parallelism is used when you have large amount of data and you want the same operation to be performed on all the data. This data by data processing can be done in parallel if they have no other dependencies with each other. Shared memory programming languages communicate by manipulating shared memory variables. A simple example of this can be seen in Figure 2. 2.1 Parallel Programming Languages Parallel programming languages and libraries have been developed for programming parallel computers.Chapter 2 Background In this chapter we discuss the basic concepts that help to better understand the discussions and explanations given in the report.1. 2. whereas Message Passing Interface (MPI) is the most widely used messagepassing system API [21]. or shared distributed memory. distributed memory. the pipeline design pattern and about Intel threading building blocks and POSIX threads. These can generally be divided into classes based on the assumptions they make about the underlying memory architecture . Here the tasks can be done concurrently because there are no dependencies between them. Distributed memory uses message passing. different forms of parallelism.

2. Figure 2.1: Data Parallelism Task parallelism is used when you have multiple operations to be performed on a data. Here the multiple tasks perform different independent operations on the same set of data concurrently.Chapter 2.3 Parallel Design Patterns Parallel software usually does not make full utilisation of the underlying parallel hardware.2: Task Parallelism 2. A simple example of this can be seen in Figure 2. In task parallelism there will be multiple tasks working on the same data concurrently. Background 9 is limited to the number of data items you have to process. It is difficult for the programmers to program patterns that utilise the maximum potential of the hardware and most of the parallel programming environments do not focus on the design issues. Figure 2. So programmers need a guide to help them during their application development that would actually enable them to get the maximum parallel .

stage 3 and stage 4 together form the total computation that has to be performed on each set of data. Figure 2. its called a Linear Pipeline Pattern.3 Figure 2.3 stage 1.4: Non-Linear Pipeline Pattern Figure 2. stage 2.3). This computation that is to be performed on a set of data. Like in an assembly line where each worker is assigned a component of work and all the workers work simultaneously on their assigned task.4 is an example for a Non-Linear pipeline pattern.1 Pipeline Pattern The Pipeline pattern is a common design pattern that is used when the need is to perform computations over many sets of data. The whole computation can be seen as data flowing through a sequence of stages. When the arrangement is a single straight chain(Figure 2. Here you can see stages with multiple operations happening concurrently. Non Linear pipelines allow feedback . These patterns provide quick and reliable parallel applications.3: Linear Pipeline Pattern In the example in Figure 2.3. can be viewed as many stages of processing to be performed in a particular order. The input data are fed to stage 1 of the pipeline where the data is processed one after the other and then passed on to the next stage in the pipeline. A good analogy of this parallel design pattern is a factory assembly line. Parallel design patterns are expert solutions to the common occurring problems that achieve this maximum parallel performance. A simple example of a pipeline can be seen in the figure 2.Chapter 2. 2. Background 10 performance.

when processed using a pipeline.5 you can see how the sets of data moves through the stages as the time passes. So a data set that serially take four time steps to process. 2. In Figure 2. when the computation to be done on a set of data is divided into stages then the work done by the stage should be comparable to the communication overhead between the stages. This gives good throughput to the system.Chapter 2. Assuming that work load is balanced equally among all the stages and that each stage takes one time step to finish its processing on a set of data. after the first five time steps a data set is obtained fully processed at every time step. The pipeline design pattern works the best when the all of its stages are equally computationally intensive. This is referred to as filling the pipeline or the latency of the pipeline. The data tokens need to be transferred between the stages which introduces an overhead on the work to be done by a stage. This is an example of fine grained parallelism because the design . Thus.3. Figure 2. Background 11 and feed forward connections and also allow more than one output stages. that need not be the last stage in the pipeline. The amount of concurrency in the pipeline is dependent on the number of stages in the pipeline. comes out processed at every time step after first four time steps. More the number of stages more is the concurrency. If the stages vary widely in the amount of work done by them then the slowest stage will be a bottleneck in the performance of the pipeline. The initial four time step delay is because all the four stages are not working together initially and some resources remain idle till all the stages are occupied with useful work.5: Flow of tokens through the stages in a pipeline along the timeline.2 Pipeline using both data and task parallelism Pipeline is a type of task parallelism where many tasks or computations are applied to a stream of data.

cache optimisations or load balancing. Multiple workers working on multiple data in stage 2. 2. The most important feature of the library is that you just need to specify task to be performed and nothing about the threads. As seen in Figure 2. This is an example of coarse grained parallelism since the interaction between the stages are infrequent. In most of the cases the pipeline stages will not be doing work of the same computationally complexity.6: Using the Hybrid approach. Mixing up data parallelism and task parallelism together gives you a good solution to this problem. some stages may take much longer time to perform its task and some less. Background 12 has frequent interaction between the stages of the pipeline.4 Intel Threading Building Blocks Threading Building Blocks is a library that supports scalable parallel programming on standard C++ code. Intel Threading Building Blocks implement most of the common iteration patterns using templates and thus the user does not have to be a threading expert knowing the details about synchronisation. Threading Building Blocks supports nested parallelism that allowed larger parallel components to incorporate smaller parallel components in it. The library itself does the job of mapping task onto the threads. Threading Build- . The computationally intensive stages can be run by multiple threads doing the same work concurrently over different sets of data. The data parallel way of doing this is by letting a single thread do the task done by the entire pipeline but let different threads work on different data concurrently.6 it introduces parallelism within each stage of the pipeline and makes the throughput of the computationally intensive stages better. This library does not need any additional language or compiler support and can work on any processor or operating system that has a C++ compiler [2]. Figure 2.Chapter 2.

1 Threading Building Blocks Pipeline In Threading Building Blocks the pipeline pattern is implemented using the pipeline and filter classes. These filters can be configured to execute concurrently on distinct data packets or to only process a single packet at a time. Background 13 ing Blocks also allows scalable data parallel programming. The Native POSIX Thread Library is the software that allows the Linux kernel to execute POSIX thread programs efficiently. The scheduler follows a work stealing scheduling policy.5 POSIX Threads POSIX thread(pthread) is the extension of the already existing process model to include the concept of concurrently running threads. 2. Pipelines in Threading Building Blocks are organized around the notion that the pipeline data represents a greater weight of memory movement costs than the code needed to process it. toss it to the next guy. Programmers sees this as a powerful options were they manipulate low level details to the needs of the application they are developing. the multiple instance of the resources that were made are the bare minimum needed for the instance to execute concurrently[11]. The physical thread works on the task to which it is mapped until it is finished and it may perform other task only when it is waiting on any child tasks or when there are not child task it would perform the tasks created by other physical threads.” it’s more like the workers are changing places while the work stays in place[7]. But then the programmer has to handle many design issues while developing application. 2. The idea was to take some process resources and make multiple instances of them so that they can run concurrently within a single process. . When the scheduler maps task to physical threads they are made non-preemptive. Pthread provided the programmers access to low level details in programming. A series of filters represent the pipeline structure in threading building blocks. The Thread scheduling implementations differs on how threads are scheduled to run.4. Rather than “do the work. The pthread API provides routines to explicitly set thread scheduling policies and priorities which may override the default mechanisms. Threading Building Blocks provide a task scheduler which is the main engine that drives all the templates. The scheduler maps the tasks that you have created onto the physical threads.Chapter 2.

The initial phase of the project is to understand the pipeline application development in TBB and to know what are the features that TBB has to offer so that we can test those features for our evaluation and analyse to see if these are really useful during pipeline application development. performance and expressibility. Threading Building Blocks implemented pipelines using the pipeline and the filter class. As we already discussed in the earlier chapter. These three filter modes are implemented and seen if it provides any favourable results. This filter class could be made to run parallel. A performance analysis will be done by calculating the speedup of the application. by this the same operation will be performed on different data in parallel. The filters can also be set to parallel by which each stage can work concurrently in a data parallel way. the stage will run serially and would process each input in the same order as they came into the input filter.Chapter 3 Issues and Methodology In this chapter we discuss about the different features of the Intel threading building block library that we intend to evaluate and the methodology that we would use to evaluate them. We also discuss the methodology used to evaluate the library in terms of usability. When you set the filter to run serially in order. serially in order or serially out of order. 14 . Pthread applications having parallel stages are to be designed so that a comparative analysis can be done. Setting the filter to serial out of order make the stages run in parallel but then the order of the input data may not be maintained. 3.1 Execution modes of the filters/stages Many key features in the Threading Building Blocks library is worth noting and tested.

Chapter 3. Issues and Methodology

15

3.2

Setting the number of threads to run the application

Another feature of the Threading Building Blocks is to run the pipeline by manually setting the number of threads to run the pipeline with. We need to test if this facility of manually deciding the number of threads to work on the implementation is a good option provided by the library. If the user decides not to set the number of threads, then the library sets the value for the number of threads which is usually the same as the number of physical threads/processors in the system. This TBB philosophy[2], of having one thread for each available concurrent execution unit/processor, is put under test and checked if it is efficient for the different pipeline applications developed. Each of the application is run with different number of threads which made us understand how it helped in increasing the performance of the application. We also test if setting the number of threads manually is more beneficial than letting the library set it automatically. The results obtained from the automatic initialisation of the number of threads is compared with the results obtained with the manual initialisation of the number of threads. Pthread application which runs with varying number of threads working on it is checked if it can give better performance results that the threading building block counterpart. Performance is measured in terms of speedup of the applications.

3.3

Setting an upper limit on the number of tokens in flight

Threading Building Blocks gives you the feature to set an upper limit on the number of tokens in flight on the pipeline. The number of tokens in flight is the number of data items that is running through the pipeline or in other words the number of data sets that are being processed in the pipeline at a particular instance of time. This controls the amount of parallelism in the pipeline. In serial in order filter stages this will not have an effect as the tokens coming in are executed serially in order. But in the case of parallel filter stages there can be multiple tokens that can be processed by a stage in parallel, so if the number of tokens in pipeline is not kept under check then there can be a case where there can be excessive resource utilisation by the stage. There might also be a case where the following serial stages may not be able to keep up with the fast parallel stages before it. The pipeline’s input filter will stop pushing in tokens once the number of tokens in flight reaches this limit and will continue only when

Chapter 3. Issues and Methodology

16

the output filter has finished processing the elements. Each of the application is to be run with different number of token limits which tells us how it helped in increasing the performance of the application. Performance here is measured in terms of the speedup of the application. Similar feature is implemented in the pthread applications and analysed.

3.4

Nested parallelism

Threading Building Blocks supports nested parallelism by which you can nest small parallel executing components inside large parallel executing components. It is different from running the stages with data parallelism. With nested parallelism it is possible to run in parallel the different processing for a single token within a stage. So in our pipeline implementation we would incorporate nested parallelism in these pipelines but incorporating parallel constructs like parallel for within the stages of the pipeline and see if its useful to the overall implementation. It will help understand how different it is from the option of running a stages in parallel by setting the filter to run in parallel. A series of performance tests are done to understand concurrently running filters and nested parallelism in the stages. The results tell us which is more efficient for the pipeline application development.

3.5

Usability, Performance and Expressibility

The next step was to decide apt pipeline applications by which the various features provided by the Threading Building Blocks library can be evaluated. Application of various types are taken with varying input size and with varying complexity of the computation it has to perform. Applications with different pipeline patterns are taken into consideration to understand the expressibility of the library in comparison to the pthread version. Implementation of the Linear and Non-Linear pipelines need to be compared with the pthread versions in terms of the performance of the applications and how easy it is for the user to make enhancements or changes into the program without actually changing much in the design of the program. Scalability is another factor that needs to be considered, it should to be understood to see if the same program gave proportional performance when ran with more or less number of processors. Usability of the languages is analysed all through the process of the software development life cycle by putting ourselves in the shoes of parallel software programmer who can be

Chapter 3. Issues and Methodology

17

new to parallel programming or a can be a threading expert. Pthread applications are developed so that they perform same as the threading building blocks applications following the same structure of the pipeline and computation complexity of the algorithm. This is done so that there is fair comparison of the application developed in both the libraries.

As a neophyte to TBB we found it very easy to understand the pipeline and filter class in the library and was quickly able to implement applications in it. 4. During the implementation phase the expressibility of the two parallel libraries will be understood and how the various features provided by the library is helpful in implementing the intended design of application will be understood.csail. Here the applications are developed using the StreamIt language so it can not be directly used for our purpose. Designing an application is a very important phase in the software development life cycle and an easy designing phase will really help a programmer build applications faster. Implementing these designs is a challenging task.mit.edu/cag/streamit/ 18 . The designing phase in this project will help understand the pros and cons of the abstraction that the threading building blocks library provides.1 Selection of Pipeline Applications For evaluating TBB we need apt applications that would bring out the pros and cons of the library. For this we used StreamIt1 benchmarks suite.Chapter 4 Design and Implementation In this chapter we discuss in detail about the designing and implementation issues during the development of various applications developed in both the parallel programming libraries. During this phase we primarily understand the usability and expressibility of the Threading Building Blocks library in comparison to the pthread library. we can evaluate threading building blocks library. So carefully analysing the designing efforts put in by a programmer due to the flexibility and the constraints the programming library provides. These ap1 http://groups. The StreamIt[19] benchmarks is a collection of streaming applications.

1 Fast Fourier Transform The coarse-grained version of Fast Fourier Transform kernel was selected. On each . plications need to be coded in Threading Building Blocks and pthread for our purpose. The selection of the applications are done such that they vary in their computational complexity. 4. This was one of the drawback found with Intel threading building blocks during this phase of the project. 4.1. of stages Low Average High 4 6 5 Pattern Linear Linear Non-Linear Table 4. The implementation was a non-linear pipeline pattern with 5 stages.Chapter 4.1. The final set of applications selected were Fast Fourier Transform kernel. Details about Decimation In Time Fast Fourier Transform can be seen at [3]. Design and Implementation 19 Application Bitonic Sorting Network Filter Bank Fast Fourier Transform Computational Complexity No. The only requirement in the implementation is that n should be a power of 2 for it to work properly. Fast Fourier Transform is done on a set of n points which is one of the inputs given to the program.1: Applications selected for the evaluation. The Fourier Fourier implementation done here is a Decimation In Time Fast Fourier Transform with input array in correct order and output array in bit-reversed order. Another input given to the program is the permuted roots-of-unity look-up table which is an array of first n/2 nth roots of unity stored in a permuted bit reversal order. The coefficients for the sets of filters are created in the top-level initialisation function. and passed down through the initialisation functions to filter objects. Because in Intel threading building blocks we cannot determine the number of stages in the pipeline during runtime thus we could not include applications like the Sieve of Eratosthenes [20] where the number of stages was determined during the runtime whereas this was possible in pthread[4].2 Filter bank for multi-rate signal processing An application that creates a filter bank to perform multi-rate signal processing was selected. Filter bank for multi-rate signal processing and Bitonic sorting network as seen in Table 4. pattern of the pipeline.1. number of stages in the pipeline and input size.

1 Fast Fourier Transform Kernel Application Design The Fast Fourier Transform kernel implementation was a 5 stages pipeline with structure as shown in Figure4. Stage 3 is the Fast Fourier transform stage where the Decimation In Time Fast Fourier transform with input array in correct order is computed. filter. The program does high performance sorting network ( by definition of sorting network. Stage 1 is the input signal generator which generates the set of n points and stores it in two arrays.1. Both of these stages generates the arrays on the run and not reading from a file so as to avoid the overhead due to the I/O which may over shadow the performance of the pipeline.2 4. Stage 2 generates the set of Bit-reversal permuted roots-of-unity look-up table which is an array of first n/2 nth roots of unity stored in a permuted bit reversal order. followed by an up-sample.1. 4. Sorts in O(n∗log(n)2 ) comparisons[1].2. delay and filter [1]. The intended design of the pipeline is as in Figure 4. Figure 4. Stage 4 is where the output array in bit-reversed order is created and passed on to the last stage in the pipeline where the output is shown to the user. comparison sequence not data-dependent ).1.3 Bitonic sorting network An application that performs bitonic sort was selected from the StreamIt benchmark suite.1: Structure of the Fast Fourier Transform Kernel Pipeline. Design and Implementation 20 branch. a delay.Chapter 4. and down-sample is performed. . 4.

1. t −>A im = ( d o u b l e ∗ ) t b b a l l o c a t o r <d o u b l e >() . d o u b l e ∗W im .1 Implementation Threading Building Blocks In the Threading Building Blocks implementation the pipeline was implemented as in Figure 4.2: Structure of the Fast Fourier Transform Kernel Pipeline implemented in TBB.2. d o u b l e ∗ W re . The implementation had a class that represented the data structure that is passed along the different stages of the pipeline. Figure 4. The implementation of the class is shown in Listing 4.2. s t a t i c dataobj ∗ allocate ( int n ) { n= n .2.Chapter 4. d a t a o b j ∗ t = ( d a t a o b j ∗ ) t b b a l l o c a t o r <c h a r >() .2 4. Listing 4. t −>A r e = ( d o u b l e ∗ ) t b b a l l o c a t o r <d o u b l e >() . a l l o c a t e ( s i z e o f ( d o u b l e ) ∗n ) . t −>W re = ( d o u b l e ∗ ) t b b a l l o c a t o r <d o u b l e >() . a l l o c a t e ( s i z e o f ( d o u b l e ) ∗n /2) . . Design and Implementation 21 4.2.1: Data structure representing Tokens in the Fast Fourier Transform Kernel Application § 1 class dataobj 2 3 4 5 6 7 8 9 10 11 12 13 14 15 { public : double ∗ A re . d o u b l e ∗ A im . static int n. a l l o c a t e ( s i z e o f ( d o u b l e ) ∗n ) . a l l o c a t e ( s i z e o f ( d a t a o b j )).

s i z e o f ( d o u b l e ) ∗ n / 2 ) . t b b a l l o c a t o r <d o u b l e >() . Stage 1 generates the input points at run time same as in the algorithm described in the benchmark suite and stores the values in the data structure dataobj as array A re and A im and passed on to the stage 2. s i z e o f ( d o u b l e ) ∗ n / 2 ) . a l l o c a t e ( s i z e o f ( d o u b l e ) ∗n /2) . Stage 2 creates the roots-of-unity look-up table which is stored in the W re and W im array. t b b a l l o c a t o r <d o u b l e >() . So this imposed a restriction of having the need to represent all the components of a single token as a single data structure so that it can be passed along the stages in the pipeline in threading building blocks. ¥ The static function allocate function creates the instance of the class allocating the memory required to perform Fast Fourier Transform on the n input points and returns the pointer to the object created. d e a l l o c a t e ( t h i s −>W im . The computation of each stage was written in the overloaded operator()(void*) function of that classes representing different stages of the pipeline. void f r e e ( ) { t b b a l l o c a t o r <d o u b l e >() . The pipeline made was a linear pipeline which implements the same logic of the original algorithm. d e a l l o c a t e ( t h i s −>A im . The function free does the job to free the allocated memory when the computations is done at the end of the pipeline and the token is destroyed. d e a l l o c a t e ( t h i s −>W re . From the data structure with array A and W is passed to the Fast Fourier transform stage where values are computed and stored in array A itself and passed on to the next stage. The pointer that the overloaded operator()(void*) function returns is the pointer to the token that has to be passed on to the next stage in the pipeline. Stage 4 finds the bit-reversed order of array A and passes it to the next stage to output values. Design and Implementation 22 16 17 18 19 20 21 22 23 24 25 26 27 } }. return t . ¦  } t −>W im = ( d o u b l e ∗ ) t b b a l l o c a t o r <d o u b l e >() . t b b a l l o c a t o r <c h a r >() . d e a l l o c a t e ( t h i s −>A re .Chapter 4. Each of these classes inherit the filter class. s i z e o f ( d o u b l e ) ∗ n ) . s i z e o f ( d a t a o b j ) ) . d e a l l o c a t e ( ( c h a r ∗ ) t h i s . t b b a l l o c a t o r <d o u b l e >() . . s i z e o f ( d o u b l e ) ∗ n ) .

A im . avail .2 Pthread 23 The pthread implementation has the same pipeline structure as the original pipeline taken from benchmark suite. The overall pipeline structure is defined as shown in Listing 4.2. W im .3: Data structure representing a stage type-1 in the Fast Fourier Transform Kernel Application § 1 stage type { 2 3 4 5 6 7 8 9 pthread mutex t pthread cond t pthread cond t int double ∗ double ∗ double ∗ double ∗ mutex . /∗ Protect data ∗/ / ∗ Data a v a i l a b l e ∗ / / ∗ Ready f o r d a t a ∗ / / ∗ Data t o p r o c e s s ∗ / / ∗ Data t o p r o c e s s ∗ / / ∗ Data t o p r o c e s s ∗ / / ∗ Data t o p r o c e s s ∗ / d a t a r e a d y .3. / ∗ Data p r e s e n t ∗ / . head2 . stages . the stages are represented by two kinds of structures. / ∗ Mutex t o p r o t e c t p i p e d a t a ∗ / / ∗ F i r s t head ∗ / / ∗ Second h e a d ∗ / /∗ Final stage ∗/ / ∗ Number o f s t a g e s ∗ / /∗ Active data elements ∗/ ¥ Here the mutex variable is used to obtain the lock over the pipeline information variables(stages and active) and protect it during concurrent access. W re . Design and Implementation 4. A re .2: Data structure representing a pipe in the Fast Fourier Transform Kernel Pthread Application § 1 struct pipe type { 2 3 4 5 6 7 8 }. active . The implementation has two input stages in the pipeline joining at stage 3 and then having stage 4 and 5 following linearly. ready .Chapter 4. stages being the count of the number of stages and active being the count of number of tokens active in the pipeline. head1 . The structure that represent the stages that receive input from a single stage and pass the token to a single stage is as shown in Listing 4. The variable tail is the pointer to the last stage in the pipeline. Listing 4. tail .2. Listing 4. In the present pipeline structure since we have two kinds of stages. The variables head1 and head2 are pointers to the two heads of the pipeline respectively.2. ¦  pthread mutex t stage t ∗ stage t ∗ stage t ∗ int int mutex .

avail.1. thread . W re . data ready is an integer variable indicating to the data sending process that the data is still now consumed in the receiving process. . The last variable next points to the next stage in the pipeline structure. mutex1 . /∗ Array A p r e s e n t ∗/ d a t a r e a d y 1 . ready . By doing so.4. Listing 4. ¦  pthread t struct stage type∗ thread . / ∗ Array W p r e s e n t ∗/ A re . ready1 . /∗ Thread f o r s t a g e ∗/ / ∗ Next s t a g e ∗ / ¥ The mutex variable is used to protect the data in the pipeline stage. the avail variables indicates to the pipelines stage that there is data ready for it to consume/process. 2. here these are pointers to the memory location that contains the data. Design and Implementation 24 10 11 12 }.4: Data structure representing a stage type-2 in the Fast Fourier Transform Kernel Application § 1 struct stage type { 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 }. avail . /∗ Protect array A ∗/ /∗ Array A a v a i l a b l e ∗/ / ∗ Ready f o r A r r a y A ∗ / /∗ Protect array W ∗/ /∗ Array W a v a i l a b l e ∗/ / ∗ Ready f o r A r r a y W ∗ / d a t a r e a d y . The structure that represent the stage that receives tokens from two stages and send tokens to only single stage is as in Listing 4. 4 and 5 in Figure 4. next . W im . There is also a pthread t variable which is the thread to process the stage. In the implementation this structure is used to represent stages 1. The structure has two sets of mutex.Chapter 4. avail1 . The ready conditional variable is an indicator to the earlier stage that the stage has finished the processing the data and ready to receive new data to process. data ready and ready variable for the two stages that try to send data to this stages. A im . The variables avail and ready are conditional variables. the granularity of locking is lowered and allows the two stages to work independently of each other. / ∗ Data t o p r o c e s s ∗ / / ∗ Data t o p r o c e s s ∗ / / ∗ Data t o p r o c e s s ∗ / / ∗ Data t o p r o c e s s ∗ / /∗ Thread f o r s t a g e ∗/ / ∗ Next s t a g e ∗ / ¥ s t r u c t stage type ∗ next . ¦  pthread mutex t pthread cond t pthread cond t pthread mutex t pthread cond t pthread cond t int int double ∗ double ∗ double ∗ double ∗ pthread t mutex . The structure also includes the data items that are to be processed by the stage.

. then it waits on a condition variable ready that tells the thread when the next stage thread is ready to accept new tokens.5: Function to pass a token to the specified pipe stage.Chapter 4.5. Stage 3 on receiving these values performs the Fast Fourier Transform of these values and send it to stage 4 where the bit reverse order of the array is created and passed on to the last stage for output. wait fo r i t ∗ t o be consumed . ∗/ w h i l e ( s t a g e −>d a t a r e a d y ) { s t a t u s = p t h r e a d c o n d w a i t (& s t a g e −>r e a d y . stores it in the W re and W im array and passes the pointers to stage 3. i f ( s t a t u s != 0) { p t h r e a d m u t e x u n l o c k (& s t a g e −>mutex ) . d o u b l e ∗ A re . d o u b l e ∗W im ) 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 /∗ ∗ Copying t h e d a t a t o t h e b u f f e r o f t h e n e x t s t a g e . i f ( s t a t u s != 0) return status . Design and Implementation 25 This implementation works because both the stages write into different locations in the data structure. Listing 4. Stage 1 generates the input points at run time same as in the algorithm described in the benchmark suite and sends pointer to arrays A re and A im to stage 3. Variations of this function was used to send tokens with different contents. Here initially the thread tries to attain lock to write into the buffer of the next stage. the thread signals the avail condition variable telling the next stage thread that the new token is ready to be processed. return status . { int status . § 1 i n t p i p e s e n d ( s t a g e t ∗ s t a g e . This structure is used to implement stage 3 in the pipeline in Figure 4. &s t a g e −>mutex ) . d o u b l e ∗ W re . The passing of tokens in the pthread application was done using the function as in Listing 4. s t a t u s = p t h r e a d m u t e x l o c k (& s t a g e −>mutex ) . After copying the values of the token to the buffer of the next stage. Stage 2 creates the roots-of-unity look-up table in parallel to stage 1. } } /∗ ∗ I f the p i p e l i n e stage i s processing data . d o u b l e ∗A im .1.

2 4. s t a t u s = p t h r e a d c o n d s i g n a l (& s t a g e −>a v a i l ) .1 Implementation Threading Building Blocks The implementation of the pipeline in Threading building blocks has the same structure as the original intended pipeline.3 4. The stage 1 is the input generation stage that creates an array of signal values. The convoluted values are stored in an array and passed on to stage 3 where it is down-sampled and .The token passed between the stages are arrays which are dynamically allocated using the tbb allocator at each stage in the pipeline. Stage 2 does the convolution of the signal with the filter coefficient matrix.1 Filter bank for multi-rate signal processing Application Design The application design is a 6 stage linear pipeline. The signal is then passed on to the next stage where the signal is convoluted with the second filter’s co-efficient matrix. s t a g e −>W im = W im . The convolution matrix is created during the initialisation phase of the program. ¥ 4. return status . s t a g e −>d a t a r e a d y = 1 . Design and Implementation 26 22 23 24 25 26 27 28 29 30 31 32 33 34 35 } ¦  } ∗/ s t a g e −>A r e = A r e .2. 4. The input signal is then convoluted with the first filter’s coefficient matrix in stage 2. The signal is then down-sampled in stage 3 and then up-sampled in stage 4.3. s t a g e −>W re = W re .Chapter 4. s t a g e −>A im = A im .3. the values are added up the into an output array until an algorithmically determined number of token arrive and then the values are output. return status . i f ( s t a t u s != 0) { p t h r e a d m u t e x u n l o c k (& s t a g e −>mutex ) .3.The input signal is generated in stage 1 is put in an array and passed to stage 2 in the pipeline. s t a t u s = p t h r e a d m u t e x u n l o c k (& s t a g e −>mutex ) .

The overall pipeline is defined with the structure as shown in Listing 4. that is each have a single source for tokens and single recipient for the tokens the structure of the stages are the same and is represented as a structure shown in Listing 4.6. Stage 5 does the convolution of the signal with the second filter co-efficient matrix that is created during the initialisation phase of the program.3. data . each time allocating new arrays to hold the new processed values and then passed on to the next stage. Design and Implementation 27 then up-sampled in stage 4. The avail and ready condition variables to indicate the availability of the data for processing and the readiness of the stage to accept new data for processing. after which the values are output. ready . The head and tail pointers point to the first and last stage in the pipeline. 4. The variables. Listing 4. avail .2. The sending of tokens was done in the same way as it was done in the case of Fast Fourier Transform Kernel. ¦  pthread mutex t pthread cond t pthread cond t int float∗ pthread t struct stage type∗ mutex . Finally in stage 6 the signal values are added up into an array until a predetermined number of tokens arrive.6: Data structure representing a stage in the Filter bank for multi-rate signal processing application § 1 struct stage type { 2 3 4 5 6 7 8 9 }. next . stages and active maintain the count of the number of . data ready .2 Pthread The pipeline implemented in pthread is same as the intended pipeline structure having 6 stages and performing the same functions as discussed for the threading building blocks version.Chapter 4.7 having a mutex variable which is used to obtain a lock over the pipeline information variables(stages and active). /∗ Protect data ∗/ / ∗ Data a v a i l a b l e ∗ / / ∗ Ready f o r d a t a ∗ / / ∗ Data p r e s e n t ∗ / / ∗ Data t o p r o c e s s ∗ / /∗ Thread f o r s t a g e ∗/ / ∗ Next s t a g e ∗ / ¥ Here you have the mutex variable to protect the data in the stage. Since the pattern is a linear pipeline and the stages have the same structure. The structure also has the pointer to the data item and variables for the thread that processes the stage and the pointer to the next stage in the pipeline structure. thread .

data ready . This included the addition of a shared memory data structure were all the threads working in a stage can access the tokens to be processed. The functionality of the components are same the one discussed for Listing 4.8. The shared memory data structure is as shown in Listing 4.8: The Shared Memory data structure for the threads working in the same stage § 1 typedef s t r u c t shared data { 2 3 4 5 6 7 pthread mutex t pthread cond t pthread cond t int float } shared mem .7: Data structure representing the pipeline in the Filter bank for multi-rate signal processing application § 1 struct pipe type { 2 3 4 5 6 7 }.4 4. Stage 1 for the input generation.Chapter 4. Stage 2 for the creation of the bitonic sequence from the input values and then stage 3 for the sorting of the bitonic sequence . ∗ data . /∗ Protect data ∗/ / ∗ Data a v a i l a b l e ∗ / / ∗ Ready f o r d a t a ∗ / / ∗ Data p r e s e n t ∗ / / ∗ Data t o p r o c e s s ∗ / ¥ 4. ¦  mutex .4.1 Bitonic sorting network Application Design The bitonic sorting network application taken was a 3 stage pipeline was a 4 stage pipeline. Listing 4. head . One instance of this data structure is shared between all the threads working in a particular stage. Listing 4. The sending of tokens to the next stage was done using the function as in 4. / ∗ Mutex t o p r o t e c t p i p e ∗ / /∗ F i r s t stage ∗/ /∗ Last stage ∗/ / ∗ Number o f s t a g e s ∗ / /∗ Active data elements ∗/ ¥ The Filter Bank application was redesigned to implement the stages with data parallelism. active . ready . Design and Implementation 28 stages and the number of tokens in the pipeline. tail . stages .6. avail . ¦  pthread mutex t stage type∗ stage type∗ int int mutex .5 except for the difference in the data that is copied to buffer of the next stage.

4. Design and Implementation 29 and the final stage to output the sorted values. The final stage is where the sorted values are output. Listing 4. It has to be ensured in the implementation that the number of tokens in flight is not more than SIZE which is easily possible in threading building blocks. So stage 1 and stage 4 was initially designed to read values from an input file and to write values into an output file respectively. Here buff is the circular buffer of arrays.9. The application is designed to sort many fixed size arrays of numbers. This can be seen in Listing 4.1 Implementation Threading Building Blocks The implementation is same as the intended pipeline design having 4 pipeline stages and having a linear pipeline structure. The application has been altered from the original to benchmark suite design to incorporate varied size inputs. The initial implementation was changed due to reasons we discuss in the evaluation section.2. Each array in the buffer was filled with input values and passed on to the next stage. The stage 1 input filter class generates a set of randomly generated numbers and passes on to the next stages. The data structure to represent tokens was initially represented as a circular buffer of fixed size. The stage 3 implements the merge phase of the bitonic sequence and does almost the same amount of computation as stage 2 in terms if number of comparison and swaps. In the initial implementation of the application the input was read from a file rather than generating the input during the execution. 4.4. Here the computation involves only comparing and swapping of values. The class also maintains a count of the number of input tokens generated and stops when a required limit is reached. .9: The Circular buffer to hold tokens § 1 c l a s s I n p u t F i l t e r : public tbb : : f i l t e r { 2 B u f f e r b u f f [ SIZE ] .2 4. Initially the application was designed to read values from a file which was sorted and then written into an output file which was later changed due to reasons that we would discuss in the later part of the report. The sending of data in all the above cases involves only the passing on pointers that point to the array of numbers. The passing tokens to the next stages involved passing of the pointers to the array of data values which was done returning the token pointer in the overloaded operator()(void*) function in the classes representing the stages. The stage 2 class implements the logic for the bitonic sequence generation from the array received from the input generation class.Chapter 4.

y ++) { b u f f e r . Design and Implementation 30 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 s i z e t nextBuffer . public : void ∗ o p e r a t o r ( ) ( void ∗) . s i z e ( ) . } = str .Chapter 4. 0) . }. v e c t o r <s t r i n g >& t o k e n s . n e x t B u f f e r = ( n e x t B u f f e r + 1 )%SIZE . f i n d f i r s t n o t o f ( d e l i m i t e r s . s t r i n g : : s i z e t y p e pos ). Tokenize ( line . toks ) . line ) ){ B u f f e r &b u f f e r = b u f f [ n e x t B u f f e r ] . while ( s t r i n g : : npos != pos | | s t r i n g : : npos != l a s t P o s ) { t o k e n s . c o n s t s t r i n g& d e l i m i t e r s = ” ” ) { s t r i n g : : s i z e t y p e l a s t P o s = s t r . nextBuffer (0) { } ˜ InputFilter (){ } v o i d T o k e n i z e ( c o n s t s t r i n g & s t r . f o r ( i n t y = 0 . s t a t i c fstream i n p u t f i l e ( InputFileName ) . c s t r ( ) ) . v e c t o r <s t r i n g > t o k s . l a s t P o s ) . v e c t o r <s t r i n g >& t o k e n s . } } void ∗ I n p u t F i l t e r : : o p e r a t o r ( ) ( void ∗) { string line . c o n s t s t r i n g& d e l i m i t e r s = ” ” ) . } else{ r e t u r n NULL . y<t o k s . f i n d f i r s t o f ( delimiters . a r r a y [ y ]= a t o i ( t o k s [ y ] . f i n d f i r s t n o t o f ( d e l i m i t e r s . s u b s t r ( l a s t P o s . pos − l a s t P o s ) ) . lastPos . pos ) . v o i d I n p u t F i l t e r : : T o k e n i z e ( c o n s t s t r i n g & s t r . } r e t u r n &b u f f e r . l a s t P o s = s t r . pos = s t r . InputFilter () : f i l t e r ( s e r i a l i n o r d e r ) . push back ( s t r . f i n d f i r s t o f ( d e l i m i t e r s . if ( getline ( input file .

7.6.Chapter 4. It contains the pointer data that points to the array and the another pointer next pointing to the next stage in the pipeline which is used during sending of data.2. The variable stages and active contains the information about the pipeline like the number of stages in the pipelines and number of tokens in flight in the pipeline respectively. The main structure of the pipeline is similar to Listing 4.2 Pthread The pthread implementation has the same intended linear pipeline design with 4 stages. The sending of tokens to the next stage was done using the function as in Listing 4. Design and Implementation 31 42 } ¦  ¥ 4.The structure contains the pointer to the head and tail stage of the pipeline structure and also has a mutex variable to protect the pipe information data.4. It contains two condition variables used to synchronise the sending and receiving of data and the mutex variable to protect the data in the pipeline. .5 except for the difference in the data that is copied to buffer of the next stage. The structure of the pipeline stage is similar to Listing 4.

Chapter 5 Evaluation and Results In this chapter we discuss the evaluation done on the threading building blocks library. This was challenging in many ways because it was the first pipeline application we were developing using the library. As a newcomer to the threading building blocks library we had the initial usual difficulties until we were actually used to the constructs and the features that the library provided. Most of the experiments are carried out on the 16-core machine.1 Bitonic Sorting Network The evaluation of threading building blocks started with the bitonic sorting network application. We then discuss about the results and evaluation of the different features provided by threading building blocks. We ran various applications on multi-core machines and measured the speedup of the different applications developed. 5. We start with the discussion about the evaluation and the results obtained for the three pipeline applications implemented. The different features in threading building blocks library is evaluated and a comparative analysis is done by implementing these features in the pthread applications developed. Executions done on the 2-core machines have been explicitly mentioned. Here we discuss the usability and expressibility issues faced during each application development and also do a performance analysis of the applications.13GHz processors with 63 GB RAM.20GHz processors with 2 GB RAM and the 16-core machine had Intel(R) Xeon(R) 2. The 2-core machine used had Intel(R) Xeon(TM) 3. We evaluate the usability and expressibility of the two parallel programming libraries and also the performance of the applications developed using those libraries. The experiments were run on 16-core and 2core machines. 32 .

As for the implementation of the computational part of performing the bitonic sort was concerned we just had to implement the C++ serial code for each stage and place them in the overloaded operator()(void*) function in the respective classes that represented the stages. After all the n buffers were filled. Evaluation and Results 33 5.1. will not be overwritten with new values. Making the stages run serially and to process the tokens in the same order as created by the input filter also ensured that the buffers that had not finished processing. As a parallel programmer we did not have to bother anything about the low level threading concepts like synchronising.1. But then we could not fail to notice that since the threading building blocks library provided features like limiting the number of tokens in flight and to make the stages run sequentially and to perform in order processing of the tokens easily allowed us to implement a design like this. By limiting the number of tokens in flight to n it was ensured that when the input filter comes to the next round of filling up of the large array starting from the first buffer. it would start again by filling in the first buffer. load balancing or cache efficiency. We had tried many data structures before we actually decided on one. . As soon as we got the right data structure for the tokens that are passed along the pipeline and implemented the computational task done by each of the stages we had been able to implement a correctly working pipeline without much hassle.Chapter 5.1 Usability Threading Building Blocks During the initial programming phase the challenging part was to create the right data structure that we would use to represent tokens and pass them efficiently across stages. One of the data structures tried had a large array which was divided into n buffers of data that represented a token in the pipeline which had to be sorted. This implementation was later changed since there was an intention to experiment by changing the number of tokens in flight in pipeline and hence this would not be the best design suited for the purpose.1 5. the first buffer in the previous round had already finished processing at the last stage of the pipeline.1. This implementation worked because we were able to limit the number of tokens in flight using the Threading Building Blocks library and also for the fact that we could program the stages to run sequentially and process the tokens in a fixed order. The size of the large array was fixed and could incorporate only a fixed number of tokens. The input filter would fill these buffers one after the other and then pass it on to the next stage for processing.

Chapter 5.1.1. We had to understand in detail about the use of thread spawning. It was a difficult task because of the amount of the implementation detail we needed to handle.2. We could easily set the input and output stages to work sequentially when the data was read and written into files. mutexes and condition variables for synchronising the threads.1 Expressibility Threading Building Blocks As for the expressibility of the threading building blocks library we had no issue implementing the desired design of the pipeline. To implement a feature like limiting the count of tokens in flight we had to include a counter in the design that kept the count and had to include mutexes for the access of the counter variable. The work that had to be done by each stage was written in separate function and had to be assigned to each thread during the thread initialisation. Getting the right design was the most difficult task but then there were many difficulties we faced during the later development stages. There were many cases in which the program could go wrong and identifying them was really hard. We had to fix the structure of the pipeline so that they correctly send and receive data between the threads handling each stage which included the use of mutexes to protect the critical section and the use of condition variable for the synchronisation of sending and receiving data.1.2 5. We were able to run the middle stages of the pipeline in parallel(Data parallelism) just by passing the keyword parallel to the filter class constructor. Evaluation and Results 5.1.2 POSIX Thread 34 Bitonic Sorting Network being our first pipeline application in pthread we had to get a lot of concepts thorough before we actually started writing the program. It was really hard to determine if the errors were due to the improper synchronisation of threads or due to the wrong computational logic implemented. 5. We could easily write data into the output file in the same order they were read from the input file by just passing the serial in order keyword to the inherited filter class constructor during the constructor call of the classes that represented the input and output stages. During the first execution of the program we had discovered a few errors and debugging a multi-threaded application was not an easy task. inter thread communications etc. The most important challenge during the application development in pthread was to get right design for the pipeline. When we imple- .

the pthread version of the application was designed with . we were easily able to do so because the library gave us the feature to set the maximum limit by passing it to the run() function in the pipeline class. the execution time of the application was very high. By replacing I/O stages with the stage that generated input during the execution. The threading building block application was easily scaled to a machine with different number of cores without the need to the change anything in the code and the results were obtained as in Figure5. Initially the assumption was that the varied execution times may be because the amount of computation in each stage was very less and must be over shadowed by the overhead of synchronisation of the threads done internally by the threading building blocks library.1. This could be because of the computational complexity of the application being low and this was overshadowed by the synchronisation mechanism implemented internally in threading building block library.2 POSIX Thread In the pthread implementation we were able to implement the intended design for the application.2. Though the results were stable. Thus the design of the application had to be changed for further tests. The implementation was working fine except for a problem that when the performance of the application was calculated it was noted that the run-time of the application varied greatly for each execution. Evaluation and Results 35 mented the design where we had to restrict the number of tokens in flight. On detailed reading about Intel threading building it was found that threading building block is not ideal for I/O bound application as the threading building block task scheduler is unfair and non-preemptive [13].Chapter 5.3 Performance The Bitonic sorting network application was initially developed with a design to read the values to be sorted from files and then write the sorted values into another file. The initial design of the pthread application with stages that reads and writes data into files. The speedup of the application was calculated in its best configuration and it was found that the speedup of the application was less the 1.1. We could limit the number of tokens in flight by including a counter in the design that maintained the count of the number of tokens in flight which was not as easy as that in the case of threading building blocks. the application gave steady results for execution times unlike the threading building block version. the application was ready to be evaluated.1. 5. Later. 5.

The bad performance was assumed to be caused due to the synchronisation mechanism implemented in the program and the stages being less computationally intensive. Results shown in Figure5. . The pthread application was also evaluated on machine with different number of cores without any change in the code this would show how a newcomer programmer can achieve good performance dependent on the machine without much effort in threading building blocks.Chapter 5.1: Performance of the Bitonic Sorting application(TBB) on machines with different number of cores.2 were obtained on evaluation. Figure 5. Evaluation and Results 36 Figure 5. On the performance analysis of the application it was noted that the pthread application took very large amount of time to execute compared to the threading building blocks version which can be seen in Figure5. input generation stages that removed the overhead due to I/O operations in the pipeline.2: Performance of the Bitonic Sorting application(pthread) on machines with different number of cores.2.

most of the threads were idle most of the time waiting on locks to be released. a test was done by removing all the thread synchronisation mechanisms in the application and calculating the run-time of the application. 5. Though the application was giving incorrect output values it could be understood from the run-time if most of the time was taken for the synchronisation of threads in the application. With the experience we attained with Bitonic sorting network application we could immediately start the work with the second application because we had familiarised ourself with both the parallel programming libraries and had a basic idea of how we would go about designing and implementing a pipeline application in both pthread and threading building blocks. Evaluation and Results 37 To confirm this. The pipeline had a longer chain with 6 stages in a linear pipeline structure. Figure 5. It can be seen that there is a drastic reduction in the execution time of the application with and without locks.2 Filter bank for multi-rate signal processing The Filter Bank application was the second application that was developed. . The results obtained are shown in Figure 5.3: Performance of the Bitonic Sorting application(pthread) with and without Locks. The Filter bank application is more computationally intensive than the bitonic sort application having to work on large signal arrays and large filter co-efficient matrices. So it was understood that the application being very less computationally intensive.3.Chapter 5.

2. As a parallel programmer we wanted to make our pipeline application run faster and we were easily able to identify the bottleneck stages by using the serial in order. serial out of order and parallel options in the filter classes and finding their speedup. 5. To create the right data structure for the token movement in the pipeline was the only challenge in the implementation.2. This was comparatively a tougher task than what we had to do for the threading building block application.1 Usability Threading Building Blocks The development of bitonic sorting network made us familiar with threading building blocks due to which the Filter Bank application was developed much faster than what we took for the bitonic sorting network.2 POSIX Thread Similar to the case of Threading building blocks the bitonic sort application implemented in pthread gave us a quick start because we had already figured out a generic structure for the pipeline.Chapter 5. Implementing the stages to run in parallel needed many changes in the already implemented design of the pthread application. This redesign though built up on the already existing design had many challenges.1 5. Getting the right design for the application was the toughest part in bitonic sorting network.1. The reuse of the design made development easy for us but then it was not the same in the case of threading building blocks because many of the design issues were abstracted by the library and the only notable challenge in threading building blocks was to get the right data structure for the tokens. By using these options we were very easily able to tweak the application for the best performance. Since we had found out the bottleneck stages in the pipeline the next step was to run the stage with data parallelism. With the bottleneck stages easily identified in the threading building block application we were easily able to tweak our application for performance but then in the case of the pthread application we had to find the single token execution time in each stage to understand which were the bottleneck stages in the pipeline. We just had to paste in the computation for each stage into the operator() function of the appropriate classes.2. which we were able to get done with moderate ease for the filter bank application.1. Because of the . With a few application dependent changes in the design we were immediately ready to start with the implementation. Evaluation and Results 38 5.

2. The bottleneck stages were identified and we were also able to run these stages in data parallel to make the implementation efficient. Evaluation and Results 39 cases where data had to be sent to many recipients and received from many senders. 5. We just had to paste the computation of the collapsed stages into one single class and that to without much change in the design of the application. Many issues like synchronisation of threads and efficiency had to be considered for the right design. showed no problems during the performance evaluation of the application.2.2 5.2. It provided us with features using which we could find the bottlenecks in the application and also run in parallel those stages with great ease.This was possible in both threading building blocks and pthread without much hassle.2.Chapter 5. The idea of collapsing stages if needed was also very easy in the case of threading building blocks. A lot of amount of time had to be spent on the redesign.2.2 POSIX Thread Pthread library provided the required flexibility to express the intended design for the pipeline applications.1 Expressibility Threading Building Blocks In terms of expressibility TBB library provided us with all the needed features for the implementation of the intended design of the application.3 Performance Filter Bank application being a computationally intensive application and having no I/O operations in it. . The long chained pipeline application worked perfectly with threading building blocks giving good speedup. It was also possible to collapse the stages for better load balance between different stages. 5.2. Changes like collapsing of stages was also possible. thereby expressing both task and data parallelism.4. testing and debugging the application which was even more harder than in the case of sequential stages whereas in the threading building block there was no need for redesigning the application as we just had to pass an argument ‘parallel’ to the filter class constructor to make the stages run in parallel. 5. The threading building blocks application was easily scalable and gave good speedup results even when tested over different machines with different number of cores as shown in Figure5.

The speedup obtained can be seen in the Figure 5. So to understand how pthread would work without any change in the code.5: Performance of the Filter Bank application(pthread) on machines with different number of cores.Chapter 5.5. . the application was run on machines with different number of cores which gave the result as shown in Figure 5. The pthread application was also able to give good speedup. Figure 5.5.4: Performance of the Filter Bank application(TBB) on machines with different number of cores. Evaluation and Results 40 Figure 5. It can be seen that the speedup obtained in the threading building block versions is much better than the pthread version which can be attributed to scheduler that threading blocks uses and also the thread abstractions that the library provides. Pthread application does not scale on its own like in the case of threading building blocks.

So there were no extra usability issues implementing a non-linear pipeline in threading building blocks in comparison to implementing a linear pipeline. This is because of the abstractions threading building blocks provided us.Chapter 5.1.3. From the programming point of you it was not any different than implementing a linear pipeline. So the stages were represented using different structure incorporating extra measures for thread synchronisation and access to shared . We just had to decide on the correct order of the stages in the linear pipeline as the non-linear pattern was converted to a linear pattern and put in the computation in the filter classes to implement the pipeline. Because of the non-linear structure of the pipeline the stages were not all the same. the only phase that took some time was to decide the correct and efficient data structure for the tokens. Just like the earlier application implementations.1 Usability Threading Building Blocks After implementing two applications in threading building blocks. Even though the pipeline application was a non-linear pipeline. designing the application was not any different because the non-linear pipeline is implemented as a linear pattern in threading building blocks.3. 5. Since we already had the required algorithm at the benchmark suite we just had to put in the code at the appropriate place.3. We had the application up and working with just a few execution tries and found the pipeline application development extremely fast and trouble free after implementing this application. But the Fast Fourier Transform kernel having a non-linear pipeline pattern had demanded extra attention into the design of the application.1.1 5. 5. Evaluation and Results 41 5. This application was particularly taken because of the non-linear pipeline pattern that was required in the implementation. the Fast Fourier Transform kernel took only a few hours for us to implement.2 POSIX Thread The experience working with the development of the previous two pthread application helped developing the pthread version of Fast Fourier Transform kernel. Its a 5 stage pipeline performing a reasonably good amount of computation at each stage.3 Fast Fourier Transform Kernel Fast Fourier Transform kernel was the third application that was developed to evaluate threading building blocks pipeline.

2 5. 5. The work around for this is that the non-linear pipeline has to be converted to a linear pipeline and then implemented with the library. Appropriate checks had to be done at the combining stages to ensure the correct order of data was combined together.1 Expressibility Threading Building Blocks The intended design for the application was a non-linear pipeline but then it was not possible to implement it in threading building blocks because the library does not support non-linear pipeline patterns.3. The Fast Fourier Transform application was developed with the intended design using the pthread library.3.2. Lot of issues like these had to be handled which made the application development a tougher task as compared to threading building block where these issues did not come up. The expressibility of threading building blocks is flawed if the need is to implement a non-linear pipeline. 5. Designing the right structure was not a simple task as in threading building blocks. 5.2.Chapter 5. In pthread every phase was tougher than that in the threading building blocks because pthread programming required attention to a lot of low level details to implement an application.3 Performance The Fast Fourier Transform kernel application’s intended design was a non-linear pipeline implementation and it was important to understand the performance of the .3.3. A lot of time had to be spent testing and implementing the correct design which consumed a lot of time. then there we could easily overcome the troubles we went into implementing the non-linear pattern. One of the good things in the pthread library because it lets the programmer work on such low details is the flexibility that it gives the programmer to implement application the way he needs it and providing fewer library related restrictions. Implementing a non-linear pipeline had its difficulties in pthread but then if the application was implemented as a linear pattern just like it was done for the threading building blocks. Evaluation and Results 42 resources.2 POSIX Thread Pthread library gives you the flexibility of implementing non-linear pipelines.

Evaluation and Results 43 threading building blocks application in comparison to the pthread application because threading building blocks does not support non-linear pipelines whereas pthread gives the flexibility to implement them. In the 5 stage non-linear pipeline the first two stages concurrently generate the data required to process a single token which results in a single output set.6: Performance of the Fast Fourier Transform Kernel application(TBB) on machines with different number of cores. Even though threading building blocks works around the problem and implements the non-linear pipeline in a linear pattern it is necessary to understand if the threading building blocks still gave the performance benefits it gave like in the other application implemented. Figure 5. But then the pipeline starts to output the results in the last stage at ever time interval which is equal to the execution time of the slowest stage .6. The results are as shown in Figure 5. It is observed that despite the flexibility that pthread offers to implement non-linear pipeline the performance results of the of the pthread application is not better than the threading building blocks performance results. it gave really good speedup for the application which also scaled well for the machines with different number of cores.7.Chapter 5. The non-linear pipeline implemented using the pthread version of the application gave good speedup when run on machines with different cores but not as much as in threading building blocks. The linear pipeline version of the Fast Fourier transform application was evaluated for performance which gave the results as shown in Figure 5. In the threading building blocks version this was implement linearly with stage 2 after stage 1 thereby not concurrently generating the data needed to process a token. It can be seen that despite the inability of threading building block library to express non-linear pipelines.

7: Performance of the Fast Fourier Transform Kernel application(pthread) on machines with different number of cores. This time is very large compared to the latency advantage the non-linear pipeline provides. 5.4 Feature 1: Execution modes of the filters/stages The selection of the mode of operation of the filters is one of the most powerful feature in the pipeline implementation. that is the initial start up time for the pipeline before it starts to output data.8 explains how the latency varies for the linear and non-linear implementation of pipeline assuming that all stages take equal amount of time to execute. The only difference that arises is in the latency of the pipeline. serial out of order stages was used in cases when the certain . And this time interval is the same for both the linear and non-linear implementation.Chapter 5. 5. The Figure 5.1 serial out of order and serial in order Filters serial in order stages was used when there was the need for certain operation to be done only by a single thread and also when the order in which the tokens are processed is to be maintained.4. Evaluation and Results 44 Figure 5. But the priorities can change depending on the application needs and there are many cases where latency of the pipeline is a crucial factor. The ease at which a programmer can set the way the stages should work definitely facilitates in faster programming. in the pipeline. This small difference in the latency of the pipelines does not make a huge difference in the performance of the pipeline most of time because the pattern is used to process large amount of data which takes lots of time to process.

Evaluation and Results 45 Figure 5. operations had to be performed serially and the order of processing of the tokens was not important. Running the stages in the serial out of order rarely gave performance benefits compared to the serial in order stages even though serial in order stages introduces some delays in between processing of tokens to ensure the tokens are processed in the right order.Chapter 5.8: Latency difference for linear and non-linear implementation assuming equal execution times of the pipelines stages. The Filter Bank pthread application was redesigned to do a comparison test with the serial in order and serial out of order filters developed in the threading building blocks library. But then there were not many cases observed where the output tokens came out of order in any of our implementation using the serial out of order stages even though there were parallel stages in between the serial out of order stages. This implementation needed a lot more work compared to that in the threading . This might again be because of the reason that the the processing time in the middle parallel stages for each and every token was the same and thereby most of the tokens reached the serial in order stage in order and there was rarely any delay introduced by the filter to order the token processing. Implementation of serial in order stages required the stages to be run only by a single thread and to provide an additional information attached to the tokens in flight that gave information on the order in which the tokens are to be processed. This might be because the processing time in the parallel stages for each and every token was the same.

4. the stages were run in parallel with different number of threads.10.9: Performance of Filter bank application(TBB) for different modes of operation of the filters. Thereby introducing both task and data parallelism in the pipeline design helps in proper load balancing and achieve good performance. Being ignorant of the order of tokens made the stages run serially out of order execution of tokens. Significant performance benefits were easily obtained when the application was developed in the threading building blocks.2 Parallel Filters The overall performance of a pipeline application is determined by the slowest sequential stage in the pipeline.9. This can be seen from Figure 5. The results obtained here are very significant because you can see the pthread version of the application out performing the threading building blocks version of the . Results obtained are as in Figure 5. 5.Chapter 5. Making the slow bottleneck stages run in parallel(Data Parallel) increases the throughput of the pipeline. Figure 5. After identification of the bottleneck stages. The Filter Bank application having stages with data parallelism was implemented in pthread to understand how the performance of the pthread application would increase by running the bottleneck stages in data parallel and to know if it gives better results than the threading building blocks library. This performance benefits was easily obtained wherever stages could be run with data parallelism and where the order of tokens processed was not important. Evaluation and Results 46 building blocks library. If all the stages in the pipeline are parallel then its a different scenario where performance depends on other factors.

This performance was not obtained by the threading building block library with the same thread count and even when all the stages were run in parallel.Chapter 5. The Bitonic sorting network degraded in performance as the number of threads was increased irrespective of the limit on the number of token in flight. The initial test was to vary the number of threads that was initialised in the scheduler and see how the application performed.10: Performance of Filter bank application with stages running with data parallelism. 5. It should also be noted that only two bottleneck stages were designed to run data parallel in the pthread application and still obtaining better results that the threading building blocks version.5 Feature 2: Setting the number of threads to run the application The Threading building blocks scheduler gives the programmer the options to initialise it with the number of threads he/she wants to run the application with. application.11. This to certain extent confirms to the assumption we made about the stages of the pipelines performing . Evaluation and Results 47 Figure 5. This feature was tested across all the three applications implemented. The pthread version of the application gives good performance with fewer threads. The experiment was carried out for different values for the maximum number of tokens in flight. In case the programmer does not have a count of threads in mind then the programmer can let the threading building blocks library decide the number of threads that is needed to run the application. The results obtained are as in Figure 5.

12 that the increase in the number of threads gave better speedup for the application. In the Filter Bank application the speedup of the application increased proportionally to the number of threads which can be seen in Figure 5. less computation intensive tasks in the case of Bitonic sorting network.Chapter 5. the performance kept on increasing.11: Performance of Bitonic Sorting Network varying the number of threads in execution.13. In the Fast Fourier Transform it can be noted from the Figure 5. Figure 5. Providing the application with more thread to work with and more tokens to work on. Evaluation and Results 48 Figure 5. The performance stabilised after a particular count of threads was reached. The speedup values stabilised after reaching a particular count of threads. From all the three examples it was seen that it was easy to identify the value of .12: Performance of Fast Fourier Transform Kernel varying the number of threads in execution.

This feature gave the programmer a very easy and powerful way to tweak his/her application to get good performance. Evaluation and Results 49 Figure 5. Except for the bitonic sorting network.Chapter 5. It can be understood from this evaluation that if the application development in threading building blocks is for a particular machines with a fixed number of cores then it is best to trial run the application for different values of thread count to choose the most ideal value. the applications gave good performance when the number of the threads was 16. But then there are cases where there were better results. the number of threads for the best performance of the application. In the case of designing application for heterogeneous machines with different number of cores.15. This was in support to the claim by the developers of threading building block that good performance is achieved when there is one thread per processor of the machine. it is ideal to leave the initialisation of the .14 and Figure 5.13: Performance of Filter Bank varying the number of threads in execution. Threading building block supports automatic initialisation of the scheduler with the number of threads and this made the application scalable to different machines with different number of cores. where the tests were run on a 16-core shared memory machine. It can be noted that the automatic initialisation done by the threading building blocks library works in most of the cases. On experimenting it was understood that the count of the number threads initialised in the scheduler was equal to the number of processors in the machine. The results obtained were as shown in Figure 5. This can be seen from the graphs. The experiment was run with different sizes of input and on machines having different number of cores. This is very powerful because there was no need for the programmer to make any changes in the code depending on the machines he/she is trying to run the application on.

number of threads to the threading building blocks library that does it based on the machine it is running on.14: Performance of the FFT Kernel for different input sizes and number of cores of the machine. When you limit the maximum number of tokens in flight to N. Thus setting .6 Feature 3: Setting an upper limit on the number of tokens in flight Setting number of tokens in flight define the amount of parallelism in the pipeline. Figure 5. Evaluation and Results 50 Figure 5.Chapter 5. then there wont be more than N operations happening in the pipeline at any instant of time. 5.15: Performance of the Filter Bank application different input sizes and number of cores of the machine.

So this threading building blocks feature is advantages because the programmer does not have to go into extra trouble of considering issues like these unlike in pthread programming. In the Fast Fourier Transform kernel application it can be seen from the Figure 5.16. Evaluation and Results 51 the right values for the number of tokens in flight is very crucial to the performance of the pipeline. The speedup of the applications were calculated and plotted on graphs as seen in Figure 5. it was necessary to understand the performance benefits that the feature easily provided to the programmer. The best example was in the bitonic sorting network as discussed earlier where one of the designs for a data structure.Chapter 5. More over in any pipeline application there is a need to have this check over the number of tokens in flight for proper functioning of the pipeline. A low value would actually reduce the amount of parallelism and will not utilise the full computing power the hardware provides. To further evaluate this feature. most of the threads will be idle most of the time. There will be many threads spawned to perform the operations in the pipeline but because there are less tokens.16: Performance of Bitonic Sorting Network varying the limit on the number of tokens in flight.17 . The three application were run for different values of tokens for a given thread count. Having a large value for the number of tokens in flight can also be a problem because it may cause excessive resource utilisation (example Memory). sorting network was computationally less intensive thus increasing the limit on number of tokens in flight did not create much difference in the speedup values. it was needed to keep a check on the number of tokens in flight to ensure that the application worked correctly. This feature was useful in many ways during the application development. The Bitonic Figure 5.

17: Performance of Fast Fourier Transform varying the limit on the number of tokens in flight. Figure 5.18. In the Filter Bank application something similar to the Fast Fourier Transform kernel can be observed in Figure 5. that the performance of the application increased with the increase in the limit on the number of tokens in flight. This feature was implemented in the corresponding pthread application to see if it gave similar performance results like in the case of threading building blocks. The re- . by which we understand it to be the optimum value for the number of tokens in flight. Evaluation and Results 52 Figure 5.18: Performance of Filter Bank varying the limit on the number of tokens in flight. The performance of the application increased proportionally to the number of tokens and stabilised after reaching a particular limit.Chapter 5. For large number of threads after reaching the value 10 the speedup values stabilised.

Chapter 5.7 Feature 4: Nested parallelism The threading building blocks library supported nested parallelism and so it was decided to evaluate this feature to understand how this could help pipeline application development. the for loops performing computation on large array and filter co-efficient matrices in the Filter Bank application was made to run in parallel using the parallel for construct provided by the library.19: Performance of pthread applications varying the limit on the number of tokens in flight. Evaluation and Results 53 Figure 5. 5. so the number of threads executing the application is constant. So as a programmer trying to achieve the best performance for his application. The graph levels out after a step rise in speedup. The number of threads processing tokens should increase proportionally to the number of tokens in flight to ensure that there are enough workers to process the tokens. This gave performance benefits to a small amount than the ones earlier in the pthread application obtained with all the stages of the filter running in parallel mode with speedup of 9.19. sults were obtained as in Figure 5. This might be because the stages here are run serially. .3.

performance requirements and scalability. development time for their application.Chapter 6 Guide to the Programmers 6.2 Experience of the programmer Novice programmer: Threading Building Block is definitely the best choice. design requirements of their application. It does not provide the flexibility to alter low level details like in the case of pthread. This guide will help the programmers make a choice depending on their experience in parallel programming. 54 . This guide provides a helping hand to future programmer developing shared memory pipeline applications. scalable and reliable applications. The guide will help programmer realise their priorities and to make the right choice between the two parallel programming libraries. It is a very crucial decision for any programmer to choose the best programming language/library that is best suited for their parallel application development. to make a choice between the conventional POSIX thread programming and Intel Threading Building Blocks. It abstracts all the lows level threading details and guides an inexperienced programmer in developing efficient. Choosing from so many variety of languages/libraries is not an easy task.1 Overview There are many parallel programming environments available now for developing parallel applications. 6. Expert programmer: Threading Building Blocks definitely guides to better applications but is not a perfect solution to all application needs.

I/O bound operations TBB: If the application to be developed has I/O bound operations then Threading Building Blocks is not the right choice because of its unfair and non pre-emptive task scheduler. POSIX Thread: Runtime determination of the number of stages in the pipeline is possible in pthread programming. POSIX Thread: Programmer is given the flexibility to implement the pipeline of any pattern as per the design requirements. Non Linear Pipeline pattern TBB: If the application design strictly requires the programmer to implement the pipeline as a non-linear pipeline then threading building block is not a good choice as it does not support non-linear pipelines. POSIX Thread: The library is good for Real-time operations because of its deterministic scheduling policy. . Real-Time applications TBB: Real-time applications cannot be implement in Threading Building Block library because of the unfair and non pre-emptive task scheduler. But if application design allows to change the pattern of the pipeline to a linear pattern with a little increase in the latency of the pipeline then threading building blocks can be used for the pipeline application development.Chapter 6. Guide to the Programmers 55 6. POSIX Thread: The library is good for I/O bound operations because of its deterministic nature.3 Design of the application Number of stages in the pipeline TBB: If the application requires to decide the number of stages in the pipeline dynamically during runtime then the application cannot be implemented in threading building blocks as the library does not support it.

Its put to the programmer to implement the feature if needed. Number of Tokens in Flight TBB: If application design requires to maintain a check on the number of tokens in flight then threading building blocks will be of good help as it provides a feature that allows to keep an upper limit on the number of tokens in flight. POSIX Thread: Data parallelism in the stages is not easily achieved. POSIX Thread: The programmer has to implement the feature if needed. POSIX Thread: Serial and order processing of tokens in all the stages is not easily achieved. 6. Serial and ordered processing of tokens TBB: If the design requires.4 Performance TBB: Good performance is easily available using threading building block library because it efficiently abstracts all the threading details which is needed to achieve . parallel while etc.Chapter 6. Programmer will have to handle all the thread synchronisation issues and ensure exclusive access on shared resources. it is very easy to make the all stages run serially and ensure ordered processing of tokens by setting the mode of operation of the filters accordingly. POSIX Thread: Nested Parallelism is difficult to incorporate. that the library provides. Nested Parallelism TBB: The operations done on a token within a stage can be broken down and made to run in parallel with the help of parallel constructs like parallel for. Guide to the Programmers 56 Stages with data parallelism TBB: It is very easy to make the stages run with data parallelism by setting the mode of operation of the filters accordingly.

POSIX Thread: Automatic scaling depending on the machine is not present in the library. This makes the application scalable without the need to change anything in the code. It is also achieved due to the working stealing scheduling policy of the scheduler.6 Load Balancing TBB: It is easily possible to collapse fast running stages and make slow bottleneck stages to run with data parallelism to ensure a load balanced pipeline. Plenty of time is spend testing and debugging the application. . POSIX Thread: It is not easily achieved. Data parallelism and nested parallelism which will improve performance is easily implemented in the pipeline using threading building blocks.7 Application Development Time TBB: If the need is faster development of efficient pipeline applications then threading building block will help you achieve this because of the abstractions it provides to do so. The programmer will have to write code to implement the feature which is a difficult task. Guide to the Programmers 57 good performance. POSIX Thread: Good performance is not easily achieved. Requires the programmer to know all the optimisation strategies needed to improve the performance of the application. the number of threads to run the application will be automatically decided by the TBB library depending on the number of processors in the machine on which the application is run. 6.5 Scalability TBB: Due to the automatic scheduler initialisation functionality.Chapter 6. 6. POSIX Thread: Development time is higher compared to threading building blocks. 6.

2 Conclusions In general. We were able to evaluate threading building blocks pipelines to a large extent. expressibility and performance. evaluations of these languages/libraries can help parallel programmer to a large extent in their decision making. There has not been much work done to evaluate parallel programming languages and libraries but now since the world is moving towards parallel programming. Implementation of various pipeline application also helped in properly evaluating the library which gave much deeper analysis of the library and how it can be useful during pipeline application development. Various features that the library provides was put under test and evaluated in terms of usability. The evaluation of the threading building block library was done in comparison to POSIX thread.1 Overview In this project we tried to evaluate Intel Threading Building Blocks Pipelines. the project was a success. 7.Chapter 7 Future Work and Conclusions 7. The commercial and academic community look forward to evaluations of parallel programming languages and libraries because there are so many options to choose and deciding on the best options is really difficult. We went through the entire life cycle of software development for different pipeline appli58 . The features threading building blocks provided was implemented in the pthread library to understand how the abstraction provided by the threading building blocks library made the job of the programmer easy. We were also able to do a comparative study with the POSIX thread library which brought out the pros and cons in each of the library.

Pthread on the other hand due the details that the programmer had to handle was prone to a lot many errors and took much longer development time than the threading building blocks applications.Chapter 7. In the designing phase we found out about the limitation of the threading building block library to express non-linear pipelines but on the other hand it provided with many features like setting the limit on the number of tokens in flight. parallel while etc. Threading Building Blocks whereas gives only fewer option to optimise the programs which may not give the best possible results. In the end we cannot actually conclude by saying that one library is better than . Having such deeper access. During the performance analysis of the application developed in the two programming libraries we were able to learn in great detail about both the libraries. With the nested parallelism feature that threading building blocks supported with other parallel constructs like parallel for. From the performance results its was found that the threading building blocks is not ideal for I/O bound tasks and real time applications. Future Work and Conclusions 59 cations in both the programming libraries. As the library provided most of the abstraction there were very less scope for errors done by the programmer which made debugging and testing the application much more easy. a programmer can easily optimise programs in application specific ways and may be get optimisations that are far better than what the threading building blocks abstractions provide. The designing phase was much more difficult for pthread application because the amount of details the programmer had to have knowledge about and also to be able get the right design incorporating all the low levels details. Even though threading building blocks achieved good performance the ease at which the performance was achieved was much easier than in the case of the pthread applications. During the implementation phase of the applications it was found that the threading building blocks was much more usable for the programmer because of the abstractions the library provided. setting the number of threads to run the application and setting the mode of operation of stages that reduced the amount of the designing details that the programmer had to handle. It is very necessary to understand that pthread though difficult to program gives the programmer the flexibility to manipulate the lowest detail. The speedup obtained for the applications developed in threading building block was much higher than the pthread versions but then on further optimising the pthread applications like introducing data parallel stages which was not an easy task we were able to obtain speedup better than in the case of threading building blocks. the performance of the threading building blocks overtook the performance of the pthread application.

OpenGL. as further work there could be much deeper evaluation of the library taking into consideration how the library handles low level details and also the working of the scheduler. The combination of all these parallel constructs may really be helpful to develop efficient parallel applications. The inability of the threading building blocks to handle I/O bound task is a serious issue. 7. and I/O parallelisation by the use of the thread bound filter feature. the required development time of the application etc. the design of the application.3 Future Work We have evaluated only the pipelines in threading building blocks library and there are many other features and parallel constructs that the library provides that can be put under test and analysed. In many cases we can see that there are certain types of operations that require that they are used from the same thread every time and by using a filter bound to a thread. This is definitely an area were there can be further detailed low level evaluations. So its upto the programmer to decide from the evaluation that we have done here and from his/her needs whether threading building blocks is best suitable for their pipeline application development.Chapter 7. . Since threading building blocks can work in combination with the pthread library a detail study can be done to analyse if the combination of these two libraries by letting pthread to handle I/O operations gives better results. you can guarantee that the final stage of the pipeline will always use the same thread [14]. Future Work and Conclusions 60 the other for pipeline application development because each library had its pros and cons depending on the experience of the programmer. Intel TBBs pipeline can now perform DirectX. In this project we have evaluated only the high level details of the threading building blocks library.

shtml. 1992. and Calin Cascaval. [3] UC Berkeley. In ICSE ’76: Proceedings of the 2nd international conference on Software engineering. 2002. O’Reilly Media. Analytical modeling of pipeline parallelism. Exper. Network-based concurrent computing on the pvm system. Sutphen.intel. Sieve of eratosthenes – pthreads implementation. [11] Dave McCracken. Overlapping io and processing in a pipeline. DC. Research-Cambridge. IEEE Computer Society Press.com/pub/proj/homework/ parallel/primes_pthread. Washington. Concurrency: Pract. edu/˜demmel/cs267/lecture24/lecture24. A complexity measure. [12] Angeles Navarro. Beautiful concurrency. [2] Intel Threading Building Blocks: Outfitting C++ for Multi-core Processor Parallelism. Geist and V.berkeley. CA. Technical report. [6] Robert B. [4] Erik Barry Erhardt and Daniel Roland Llamocca Obregon. Technical report.cs. 1994. Sunderam. Computer.com/en-us/blogs/2007/08/23/ overlapping-io-and-processing-in-a-pipeline/. http://groups. Siham Tabik. 2007. 1976. http://software. pages 281–290. 3(3):203–219. In PACT ’09: Proceedings of the 2009 18th International Conference on Parallel Architectures and Compilation Techniques. USA.mit. IBM Linux Technology Centre. Concurrency: Pract. 1991. Posix threads and the linux kernel. Grady. T. A. USA. November 2006. [8] Simon Peyton Jones. Decimation in time fast fourier transform.edu/cag/streamit/ shtml/benchmarks.html. Exper. Rafael Asenjo. November 2006. Breitkreutz. S. [10] Thomas J.. page 407. and S. http://statacumen. 61 . A. www. 2009. McCabe. Successfully applying software metrics. 4(4):293–311. 27(9):18– 25. Marsland. Los Alamitos. 2007.Bibliography [1] Streamit benchmark suite.c. [7] Robert Reed (Intel). IEEE Computer Society. A network multi-processor for experiments in parallelism. Microsoft [9] T. [5] G.csail..

Enterprise in context: assessing the usability of parallel programming environments. IEEE Computer Society Press. [14] Intel Software Network. Greg Lobe. Department of Computer Science. 1993. and Duane Szafron. intel threading building blocks. and Ian Parsons. [18] Steven P.intel. Intl Conf. Karczmarek W. France.. Canada T6G 2H1. Technical report. 1997. Washington. July 2009.com/en-us/articles/ intel-threading-building-blocks-openmp-or-native-threads/. In Proc. Daphna Nathanson. [15] Manohar Rao. [16] Jonathan Schaeffer. http://en. on Compiler Construction (CC). August 2009. Duane Szafron. [19] M.intel. Zary Segall. Edmonton.org/wiki/Sieve_ of_Eratosthenes. Los Alamitos. [22] Gregory V. IEEE Computer Society. Streamit: A language for streaming applications. November 2006. openmp. [20] Wikipedia. http://en. In Supercomputing ’90: Proceedings of the 1990 ACM/IEEE conference on Supercomputing. CA. Version 2. In HIPS ’97: Proceedings of the 1997 Workshop on High-Level Programming Models and Supportive Environments (HIPS ’97). . University of Alberta. VanderWiel. USA. Technical report. Jonathan Schaeffer. or native threads? http://software. USA. Grenoble. Intel threading building blocks. IBM Press. Parallel programming languages. Amarasinghe. DC. pages 594–603. [21] Wikipedia. and Dalibor Vrsalovic.2.org/ wiki/Parallel_computing#Parallel_programming_languages.Bibliography 62 [13] Intel Software Network.wikipedia. http://software. pages 999–1010. The enterprise model for developing distributed applications. Sieve of eratosthenes. Alberta. 1993. [17] Duane Szafron and Jonathan Schaeffer. 1994. Complexity and performance in parallel programming languages. Thies and S. Lilja.pages 179196. and David J. 1990. Implementation machine paradigm for parallel programming. An experiment to measure the usability of parallel programming systems. page 3. 2002. worth a look.wikipedia. In CASCON ’93: Proceedings of the 1993 conference of the Centre for Advanced Studies on Collaborative research. Wilson.com/en-us/blogs/2009/08/ 04/version-22-intel-threading-building-blocks-worth-a-look/.