You are on page 1of 15

PERFORMANCE ANALYSIS OF PARALLELISED APPLICATIONS ON MULTICORE PROCESSORS

Vishnu.G (CB107CS070), Baskar.R (CB107CS110), Avinash Kumar.M.K (CB107CS209) DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING AMRITA VISHWA VIDYAPEETHAM ETTIMADAI,COIMBATORE-641105 ggvishnu29@gmail.com ABSTRACT: This Project is aimed at analysing and comparing the performance of selected applications which are parallelised and made to run on a multicore platform. Tools like Vtune performance analyser, Intel C++ compiler, OpenMP Concurrency platform are used for collecting thread data and program execution time. Using the data collected from the tools, a graph is drawn showing the speed up attained when the applications are parallelised Introduction: The Digital age paves way for an undeterred growth in all fields of science, especially in the field of microprocessors, as the number of transistors on a microprocessor chip has grown exponentially. All of this was in accordance with Moore's law which proposed that the number of transistors in Intel's chips would be doubled every Eighteen months, and so, as such the number of transistors on each chip reached millions and there was no stopping this colossal growth. The clock frequency of operation rose from a few megahertz to a few gigahertz . Later, it became evident that the heat generated was so intense that any more over clocking of the micro processor would result in the chip melting to the intense heat generated during its operation. The frequency of operation could not be increased further without excessive heat generation. This paved way for the multicore processors to come to the fore. A multicore processor consists of multiple cores or multiple CPU's in one chip. A multi-core processor is quite different from separate processors work together, with each processor having its own dedicated memory, cache and other hardware whereas in a multi-core processor, the multiple execution cores are placed on the same chip and resources such as cache and memory are common to all the execution cores. When we run two applications simultaneously in a CPU, what actually happens is that the CPU time is shared between the two applications which are run simultaneously, this CPU time slice for each application would be in the order of microseconds. In a multi-core system there are say n number of cores, so it is feasible to run n number of applications simultaneously. This Paper is aimed at envisaging the need for parallelizing present day software for better adaptation to multi-core architectures. Our project mainly deals with analyzing the performance of applications which are parallelized using a concurrency platform. We used various tools provided by Intel for testing and analysing the application. They are as follows: 1. Intel C++ compiler 2. OpenMP concurrency platform 3. Vtune performance analyzer

Intel C++ compiler: This is a C++ compiler with support for all OpenMP directives (pragmas). OpenMP concurrency platform: OpenMP stands for Open Multi Processing. It is an Application Program Interface (API) that may be used to explicitly direct multithreaded, shared memory parallelism. It comprises of three primary API components: 1. Compiler Directives 2. Runtime Library Routines 3. Environment Variables 5. Image conversion. 6. Path finding Matrix multiplication: This application involves summation of the product of the element from two matrices (A and B) and storing the value in the product matrix (C). Since the operations involved for each element in the product matrix is independent of each other, they can be parallelised easily. So, we parallelised the application by including OpenMP directives. We will show the parallel version of the code and will show the difference in the runtime by using Vtune analyzer. In this application large pairs of matrices with predefined values are multiplied. Two parallel threads executing simultaneously on each core are taken for consideration while generating the call graph. Parallelization of the outer loop: Matrix_mul() { int numprocs=omp_get_num_procs(); omp_set_num_threads(numprocs); double start=omp_get_wtime(); for(x=0;x<n;x++) { //Pragma Directive #pragma omp parallel for shared(a,b,c) private(i,j,k) for(i=0;i<100;i++) { for(j=0;j<100;j++) { c[i][j]=0; for(k=0;k<100;k++) { c[i][j]+=a[i][k]*b[k][j]; } } }

Vtune performance analyser: This tool can be used to analyse the performance of any application in detail The call graph wizard is used for performance analysis. The sampling wizard can be optimized by selecting the advanced performance analysis wizard. PROGRAMS AND ANALYSIS: We selected some applications that can be parallelised without any hazards or data dependencies. We first made a serial version of the application and calculated the execution time. Then, we parallelised the applications and again calculated the execution time and a significant reduction in execution time when parallelised. We used the OpenMP directive to parallelise the applications. The following applications were analysed: 1. Matrix multiplication 2. Merge sort 3. Odd-even sort 4. Database search

} double end=omp_get_wtime(); cout<<"end:"<<endl; cout<<"run time:"<<(end-start); } } Parallelization of the inner-most loop: Matrix_mul() { for(x=0;x<n;x++) { for(i=0;i<100;i++) { for(j=0;j<100;j++) { c[i][j]=0; //Pragma Directive #pragma omp parallel for shared(a,b,c) private(i,j,k) for(k=0;k<100;k++) { c[i][j]+=a[i][k]*b[k][j]; } } } } double end=omp_get_wtime(); cout<<"end:"<<endl; cout<<"run time:"<<(end-start); } } There are two important issues which we found out to be important while parallelising the application. They are as follows: 1.The number of threads created should be proportional to the number of processors. We included the omp function and set the environment variable to the number of cores available in the processor before parallelising the application.

2.The function that each thread is going to execute should be complex enough to see the performance gain-up. In the above matrix multiplication program, we parallelised the whole matrix multiplication operation involving two entire matrices rather than parallelising the operations for each element in the matrix. To illustrate this, we included pragma function for the inner most for loop and analysed the application. We found that it takes more execution time than the serial version. Analysis of Results: The above application was run with the data sets mentioned in the table and the respective results were obtained. Datasets 10000 (Pairs of Matrices Multiplied) Runtime (Serial)* Runtime (Parallel)* Speed Up 11.86 6.77 1.75 40000 80000

47.3 24.46 1.93

94.57 48.86 1.94

* Note: All Run-time data represented in seconds Average Speed Up thus obtained was found to be 1.8717 . From the above table it was observed that as the volume of data being processed is increased, parallelism becomes very important for better program execution time. Call Graph function for serial code for Matrix multiplication: As seen from the screenshot(figure 1a present in the diagrams section at the end of this paper) from the Vtune performance analyzer for a sequential execution, it is seen that two

threads namely the master thread (Thread_0) and the child thread (Thread_1). Call Graph function for Parallelized code for Matrix multiplication: The screenshot (figure 1b present in the diagrams section at the end of this paper) shows the spawned threads for a parallel execution of the application. In this program, we included pragma directive inside the outer for loop. Here we set the environment variable by calling omp_get_num_procs() and omp_set_num_threads(). This function returns the number of processors available in the system and sets that value to the environment variable. Here, each set of matrices (B and C) multiplication is parallelised. Since the number of cores available in this case is 2, two child threads are created. Time taken for parallel version: Thread_0: Execution Time: 6.326343 s Thread_1: Execution time: 6.714668 s Thread_2: Execution time: 6.126622 s Merge Sort: The second application which we selected to parallelise is sorting. It involves dividing a list into two and sorting the two parts recursively (divide and conquer approach). Here we first divided the list into two and called merge sort function (recursive function) and merged the two lists into one. As the list gets divided into two, threads are created and mapped onto the available cores as shown in the call graph. We have shown both the serial version and the parallel version of the code and

the difference in their runtime analyzer. //parallel version

using Vtune

Merge_sort() { int numprocs=omp_get_num_procs(); omp_set_num_threads(numprocs); double start=omp_get_wtime(); #pragma omp parallel { #pragma omp sections { #pragma omp section mergesort(1,n/2); #pragma omp section mergesort((n/2)+1,n); } merge(1,(n/2),n); } double end=omp_get_wtime(); cout<<"run time:"<<(end-start)<<endl; } } Analysis of Results: Number of 100000000 unsorted (200 elements Million) Runtime(Ser 15.39 ial)* Runtime(Par 10.11 allel)* 200000000 (200 Million) 31.85 21.8 300000000 (300 Million) 48.88 32.23

*All Run-time data represented in seconds Average Speed Up: 1.498926 (Refer the screenshots present at the end of the paper 2a and 2b) Time taken for completion of threads and waiting time:

Time taken for serial version: Thread_0: Execution time: 15.894525 s wait time: 15.892648 s Thread_1: Execution time: 15.838973 s wait time: 0.250066 s In this program, we divided the list into two and parallelised that section by calling pragma sections function. Two threads are created and each thread sorts a list (divided). At last, the two lists are merged into a single list. The following thread specific data illustrates the time each thread has been in waiting, Time taken for parallel version:

//parallel version Odd_even_sort() { start=omp_get_wtime(); #pragma omp parallel { #pragma omp sections { #pragma omp section oesort(1,n/2); #pragma omp section oesort((n/2)+1,n); } merge(1,(n/2),n); } end=omp_get_wtime(); cout<<"Time:"<<(end-start); }

Thread_0: Execution time: 10.586861 s wait time: 10.581956 s Thread_1: Execution time:10.563912 s wait time: 0.537342 s Thread_2: Execution time:10.057626 s wait time: 10.057574 s Odd-Even Sort: Description: In many areas of application, dealing with a large pool of numbers becomes unavoidable. (For example, a company has to prepare an expense chart and it has to list out the expenditure for every single item in some order).There are only two kinds of numbers ,odd and even. And now that we have a multicore processor, we can run a thread on one core to sort out the odd numbers and the other thread for the even numbers on the other core in parallel.

Number of 100000 elements Runtime(S 5.84 erial)* Runtime(P 3.74 arallel)*

200000 23.17 14.81

300000 52.09 32.24

*All Run-time data represented in seconds Average Speed Up: 1.53635 (Refer the screenshots present at the end of the paper 3a and 3b) Time taken for serial version: Thread_0: Execution time:8.712375 s wait time: 0 s Since it is the only thread index execution it does not have to wait for the completion of any thread

In this program, the list is divided into two lists. Each list is given to a thread and sorting is done. Then, two lists are merged by calling merge () function. The following is the amount of time the various threads had taken for completion and spent in waiting Time taken for parallel version: Thread_0: Execution time: 3.784085 s wait time: 3.767029 s Thread_1: Execution time:3.848531 s wait time: 0.806332 s Thread_2: Execution time:3.780029 s wait time: 3.779433 s Database search: This is one of the major applications which can be parallelised. Most of today’s search involves a tree data structure. Searching in a tree can be parallelised. We can divide the tree structure into two trees (by considering two children as two roots) and giving each tree to each thread. We will show both the serial version and the parallel version of the code and will show the difference in their runtime by using Vtune analyser. Database_search() { //serial version start=omp_get_wtime(); search(root->l,dat); search(root->r,dat); end=omp_get_wtime(); cout<<"runtime:"<<(end-start)<<endl; //parallel version int numprocs=omp_get_num_procs();

omp_set_num_threads(numprocs); start=omp_get_wtime(); #pragma omp parallel { #pragma omp sections { #pragma omp section search(root->l,data); #pragma omp section search(root->r,data); } } end=omp_get_wtime(); cout<<"run time:"<<(end-start)<<endl; } Analysis of Results: Number 40000 of Entities in the database Runtime( 5.97 Serial)* Runtime( 9.23 Parallel)* 60000 100000

13.81 9.84

20.88 12.82

(Refer the screenshots present at the end of the paper 4a and 4b) Average speed up=1.50 Time taken for serial version: Thread_0: Execution time: 192.342031 s wait time: 42.430577 s Thread_1: Execution time:45.863450 s wait time: 34.584689 s In this program, we divided the tree into two sections. Each section is given to a thread. Left

child of the root is assumed to be the root in one thread and searching is done. In another thread, right child of the root is assumed to be the root and searching is done.

C++ code: //parallel version int num=omp_get_num_procs(); omp_set_num_threads(num); double start=omp_get_wtime(); for(k=1;k<=n;k++) { #pragma omp parallel for shared(picture) private(i,j) for(i=1;i<=300;i++) { for(j=1;j<=300;j++) { if(picture[i][j]>=128) picture[i][j]=255; else picture[i][j]=0; } } } double end=omp_get_wtime(); cout<<"run time:"<<(end-start)<<endl; } } Analysis of Results: Number of 10000 sub-images of size 300 X 300 Runtime(Se 4.7887 rial)* Runtime(Pa 2.79798 rallel)*

Time taken for parallel version: Thread_0: Execution time: 56.537615 s wait time: 21.116431 s Thread_1: Execution time: 21.977969 s wait time: 21.761701 s Thread_2: Execution time: 22.785477 s wait time: 17.858027 s Image Conversion (RGB to greyscale): Image processing is a computation intensive area. Large number of images are sent and received over the internet every day. In case of low bandwidths it takes an image, quite some time to be loaded. Especially for text images which need not be colourful to be understood. In such cases, we can convert the original image into a greyscale image or a binary image. An image is a two-dimensional array of pixels or values of colours (from 0 to any number; we have it here as 0-255, standard 8 bit colour image). What we have done here is that we took the average contrast of an image and then we adjusted the contrast in such a way that those pixels with a value lower than the average are made 0 and those with a higher value are made 255(in binary it is 1). Since this operation is a set of independent computations we can parallelise the execution by dividing the matrix and giving each thread a part of the matrix to work upon. Due to size constraints we assumed the maximum number of pixels an image could have, in this case we assumed it to be 90601 pixels in an image and higher resolution images would be built upon from the sub images.

20000

30000

9.52558 5.56294

14.236 8.32684

*All Run-time data represented in seconds Average Speed-Up: 1.7111 (Refer the screenshots present at the end of the paper 5a and 5b)

Time taken for serial version: Thread_0: Execution time: 9.537091 s wait time: 9.472819 s Thread_1: Execution time: 9.459156 s wait time: 0.130074 s In this program, an image is divided into number of sub images and each sub image is processed by a thread. Time taken for parallel version: Thread_0: Execution time: 3.214905 s wait time: 2.631652 s Thread_1: Execution time: 6.576604 s wait time: 6.576558 s Thread2: Execution time: 3.043152 s wait time: 2.744147 s PATH FINDING: This program detects the path between two nodes - a starting node and an ending node. There may be many ways to reach the destination point. All possible paths are searched in the graph drawn using the given nodal details(which is always placed in the first quadrant and the co-ordinates of the starting node should be lesser than the ending node, otherwise the recursive procedure becomes an infinite loop). This program can be parallelized effectively using a recursive procedure. A large number of threads are created and each thread will search for a path to reach the destination. If one thread finds a path then the function ends and all threads quit from execution. This program is a simple illustration which is applied

in many fields. For example, in networking, packet switching is done where the packets may be sent in different ways. This is one of the major fields where this algorithm can be applied. The runtime may change when analyzed at different times. This depends on the state of the processor.

C++ code: void recury(int i,int j) { //Serial version if(i==ei&&j==ej) { cout<<"found:"<<endl; exit(1); } else { if(i+1<=15) { recury(i+1,j); } if(j+1<=15) { recury(i,j+1); } } //parallel version if(i==ei&&j==ej) { cout<<"found:"<<endl; exit(1); } else { #pragma omp parallel {

#pragma sections shared(i,j) { #pragma omp section if(i+1<=15) { recury(i+1,j); } #pragma omp section if(j+1<=15) { recury(i,j+1); } } } Analysis of Results:

wait time: 9.472819 s Thread_1: Execution time: 9.459156 s wait time: 0.130074 s Time taken for parallel version: Thread 0: Execution time:1.214943 s wait time: 1.188413 s Thread_1: Execution time:1.286208 s wait time: 0.655480 s Thread_2: Execution time:0.392150 s wait time: 0.239700 s Graph of the achieved Speed-Up :

Datasets ( 1000 No of nodes)

2000

3000
2.5

Runtime( 0.783345 3.7046 Serial)* Runtime( 0.1947 Parallel)* Speed Up 4.02334 2.32 1.5968

20.2386 11.79564 1.7157

2 1.5 1 0.5 0 A B C D E F AVE AGE R S E PE D-UP

*All Run-time Execution time represented in seconds Average Speed-Up: 2.44528 (Here the speed up is greater than 2 because of cache coherency) Datasets denote the size of the graph in which the two nodal points exist. In this case it refers to the size of two dimensional matrix 1000X1000. (Refer the screenshots present at the end of the paper 6a and 6b) Time taken for serial version: Thread 0: Execution time: 2.4465558 s A - Matrix Multiplication B - Merge Sort C - Odd -Even Sort D - Database search E - Image conversion F - Path finding

Call Graph function for serial code for Matrix multiplication(1a):

Call Graph function for Parallelized code for Matrix multiplication(1b):

Call Graph Function for serial code for Merge-sort(2a):

Call Graph Function for Parallelized code for Merge-sort(2b):

Call Graph Function for serial code for OddEven sort(3a):

Call Graph Function of Parallelized code for Data base Search(4b):

Call Graph Function for Parallelized code for Odd-Even sort(3b):

Call Graph Function of serial code for Data base Search(4a):

Call Graph function of serial code for Image Conversion(5a):

Call Graph function of parallel code for Image Conversion(5b):

Call Graph Function for serial code of Path Finding(6a):

CONCLUSION: Thus, from the performed analysis using available tools and Concurrency platforms, it is clear that applications which are parallelised run more efficiently than serial code on a multicore processor. A detailed view of the working mechanism of threads on multiple cores is also presented. This project would serve as a benchmark for further developments and advancements in this area. This paper's ultimate goal is to reveal the real power of multi-core processors, that, every application produced in forth-coming days will be parallelised for multicores is not very far.

Call Graph Function for Parallel code of Path Finding(6b):

References: [1] Tracey and Cameron Hughes. Professional Multi-core programming

[2] Morgan Kauffman. Computer Architecture. A Quantitative Approach,3rd Edition [3] Robert L. Britton. MIPS Assembly language Programming [4] Charles E. Leiserson ,Ilya B. Mirman.How To Survive the Multi-core Software Revolution [5] Sun Microsystems.Open Sparc Internals [6] Daan Leijen and Judd Hall.Optimize Managed Code For Multi-Core Machines [7] Intel software documentation