This action might not be possible to undo. Are you sure you want to continue?

# A PROJECT REPORT

on

**Numerical Methods Implementation On CUDA
**

submitted for partial fulﬁllment for the degree of Bachelor of Technology

in

**Department of Computer Engineering
**

(2007-11)

Supervisor: Dr. Vijay Laxmi

Ankur Sharma (2007UCP132) Nihar Amin (2007UCP161) Praveen Khokher (2007UCP157) Shehjad Khan (2007UCP113)

MALAVIYA NATIONAL INSTITUTE Of TECHNOLOGY, JAIPUR

May 2011

Contents

Acknowledgements Certiﬁcate 1 Overview Of CUDA Programming 1.1 Introduction . . . . . . . . . . . . 1.2 Thread Level Heirarchy . . . . . . 1.3 Memory Level Heirarchy . . . . . Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ix xi 1 1 2 3 5 5 logics:6 6 7 7 7 7 8 8 11 11 12 12 12 12 13 13 13 14 17 17 18 19

2 Implementation Of Matrix Multiplication Algorithm On CUDA 2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.2 Matrix proves to be advantageous in the implementation of following 2.3 Sequential matrix-multiplication: . . . . . . . . . . . . . . . . . . . 2.4 Parallel matrix-multiplications on CUDA:- . . . . . . . . . . . . . . 2.4.1 Implementation: . . . . . . . . . . . . . . . . . . . . . . . . . 2.5 Kernel Speciﬁcations: . . . . . . . . . . . . . . . . . . . . . . . . . . 2.6 Salient Features: . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.7 Limitations: . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.8 Observations: . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.9 Conclusions: . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3 Implementation Of Preﬁx Sum Algorithm 3.1 Introduction . . . . . . . . . . . . . . . . . 3.2 Sequential Preﬁx-sum algorithm: . . . . . 3.3 Parallel Preﬁx-Sum On CUDA: . . . . . . 3.3.1 Implementation- . . . . . . . . . . 3.4 Kernel Speciﬁcations: . . . . . . . . . . . . 3.5 Salient Features: . . . . . . . . . . . . . . 3.6 Limitations: . . . . . . . . . . . . . . . . . 3.7 Observations: . . . . . . . . . . . . . . . . 3.8 Conclusions: . . . . . . . . . . . . . . . . . On CUDA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

6

. . . . . . . . .

. . . . . . . . .

. . . . . . . . .

. . . . . . . . .

. . . . . . . . .

. . . . . . . . .

. . . . . . . . .

4 Implementation Of Bitonic Sort Algorithm On 4.1 Introduction . . . . . . . . . . . . . . . . . . . . 4.2 Parallel Bitonic-Sort On CUDA: . . . . . . . . . 4.3 Salient Features: . . . . . . . . . . . . . . . . . i

CUDA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

ii 4.4 4.5 4.6

CONTENTS Limitations: . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19 Observations: . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19 Conclusions: . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20 23 23 23 24 24 25 25 25 26 26 26

5 Implementation of Odd Even transposition Sort 5.1 Introduction . . . . . . . . . . . . . . . . . . . . . 5.2 The odd even merge sort is advantageous as it can 5.3 Sequential Odd-Even Merge Sort: . . . . . . . . . 5.4 Parallel Odd Even Transposition Sort: . . . . . . 5.4.1 Implemention . . . . . . . . . . . . . . . . 5.5 Kernel Speciﬁcation: . . . . . . . . . . . . . . . . 5.6 Salient Features:- . . . . . . . . . . . . . . . . . . 5.7 Limitations: . . . . . . . . . . . . . . . . . . . . . 5.8 Observations: . . . . . . . . . . . . . . . . . . . . 5.9 Conclusions: . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . .

. . . . . . . . . .

. . . . . . . . . .

. . . . . . . . . .

. . . . . . . . . .

. . . . . . . . . .

. . . . . . . . . .

. . . . . . . . . .

. . . . . . . . . .

6 Implementation Of Parallel Quicksort By Regular 6.1 Introduction . . . . . . . . . . . . . . . . . . . . . . 6.2 Sequential Quicksort: . . . . . . . . . . . . . . . . . 6.3 Parallel Quicksort Using Regular Sampling: . . . . 6.3.1 Implementation: . . . . . . . . . . . . . . . . 6.4 Kernel Speciﬁcations: . . . . . . . . . . . . . . . . . 6.5 Salient features: . . . . . . . . . . . . . . . . . . . . 6.6 Limitations: . . . . . . . . . . . . . . . . . . . . . . 6.7 Observations: . . . . . . . . . . . . . . . . . . . . . 6.8 Conclusions: . . . . . . . . . . . . . . . . . . . . . .

Sampling Algorithm On CUDA 29 . . . . . . . . . 29 . . . . . . . . . 29 . . . . . . . . . 30 . . . . . . . . . 31 . . . . . . . . . 31 . . . . . . . . . 31 . . . . . . . . . 32 . . . . . . . . . 32 . . . . . . . . . 32

7 Implementation of matrix transpose algorithm on CUDA 35 7.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35 7.2 Matrix transpose proves to be advantageous in the implementation of following logics: 36 7.3 Sequential matrix transpose: . . . . . . . . . . . . . . . . . . . . . . 36 7.4 Parallel matrix transpose: . . . . . . . . . . . . . . . . . . . . . . . 36 7.4.1 Implementation: . . . . . . . . . . . . . . . . . . . . . . . . . 36 7.5 Kernel speciﬁcations: . . . . . . . . . . . . . . . . . . . . . . . . . . 37 7.6 Salient features: . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37 7.7 Limitations: . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37 7.8 Observations: . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37 7.9 Conclusions: . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38 8 Implementation of parallel sum algorithm on CUDA 8.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8.2 Parallel-sum proves to be advantageous in the implementation of 8.3 Sequential Parallel-Sum Algorithm:- . . . . . . . . . . . . . . . . 8.4 Parallel Preﬁx-Sum: . . . . . . . . . . . . . . . . . . . . . . . . . 8.4.1 Implementation: . . . . . . . . . . . . . . . . . . . . . . . 41 . . 41 following logics: 41 . . 42 . . 42 . . 42

CONTENTS 8.5 8.6 8.7 8.8 8.9 Kernel Speciﬁcation:Salient Features:- . . Limitations:- . . . . . Observations: . . . . Conclusions: . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

iii 43 43 43 44 44 47 47 47 48 48 48 49 49 49 50 53

9 Calculation Of Variance and Standard Deviations on CUDA 9.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9.2 Finding VARIANCE AND DEVIATION proves to be advantageous 9.3 Sequentially Calculate Variance and SD: . . . . . . . . . . . . . . . 9.4 Parallely Calculate Variance and SD: . . . . . . . . . . . . . . . . . 9.4.1 Implementation: . . . . . . . . . . . . . . . . . . . . . . . . . 9.5 Kernel Speciﬁcation: . . . . . . . . . . . . . . . . . . . . . . . . . . 9.6 Limitations: . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9.7 Observations:- . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9.8 Conclusions: . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10 Data of Algorithms

List of Figures

1.1 1.2 2.1 2.2 2.3 2.4 3.1 3.2 3.3 4.1 4.2 4.3 4.4 4.5 5.1 5.2 6.1 6.2 6.3 6.4 7.1 7.2 7.3 8.1 8.2 8.3 9.1 9.2 9.3 Thread Level Heirarchy . . . . . . . . . . . . . . . . . . . . . . . . . Memory Level Heirarchy . . . . . . . . . . . . . . . . . . . . . . . . Thread Level Heirarchy . . . execution time vs Input size SpeedUp vs input size . . . SpeedUp vs input size . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2 3 5 8 9 9

Preﬁx-sum algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . 14 Preﬁx-sum algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . 15 Preﬁx-sum algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . 15 Sample Bitonic Sorting . . . . . Kernel Used in Bitonic Sorting . Execution time vs input size . . slope of speedUp vs input size . speedUp vs input size . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18 19 20 20 21

Execution time vs input size . . . . . . . . . . . . . . . . . . . . . . 26 speedUp vs input size . . . . . . . . . . . . . . . . . . . . . . . . . . 27 Sequential Quicksort algorithm execution time vs input size . . speedUp vs input size . . . . . . speedUp vs input size . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30 33 33 34

Execution time vs input size . . . . . . . . . . . . . . . . . . . . . . 38 speedUp vs input size . . . . . . . . . . . . . . . . . . . . . . . . . . 38 speedUp vs input size . . . . . . . . . . . . . . . . . . . . . . . . . . 39 Execution time vs input size . . . . . . . . . . . . . . . . . . . . . . 44 speedUp vs input size . . . . . . . . . . . . . . . . . . . . . . . . . . 45 speedUp vs input size . . . . . . . . . . . . . . . . . . . . . . . . . . 45 Execution time vs input size . . . . . . . . . . . . . . . . . . . . . . 50 speedUp vs input size . . . . . . . . . . . . . . . . . . . . . . . . . . 50 speedUp vs input size . . . . . . . . . . . . . . . . . . . . . . . . . . 51

v

List of Tables

10.1 10.2 10.3 10.4 10.5 10.6 10.7 10.8 Matrix Multiplication(time in 10−6 s) . . . . Bitonic Sort Algorithm (time in 10−6 s) . . . Preﬁx Sum (time in 10−6 s) . . . . . . . . . . Odd-Even Transposition Sort (time in 10−6 s) Quicksort (time in 10−6s) . . . . . . . . . . . Matrix-transpose (time in 10−6 s) . . . . . . Summation Algorithm (time in 10−6 s) . . . Variance and SD (time in 10−6s) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54 54 54 55 55 55 56 56

vii

Acknowledgements

We wish to express our gratitude to all people involved in the successful completion of our Final Year Major Project, especially to our project mentor Dr. Vijay Laxmi for her guidance and critical reviews. Our sincere thanks to Dr. M.S Gaur who was very generous to devote his precious time, sharing his knowledge with us, and helping us out in every possible manner We are also thankful to all of our team members, working with whom was a great experience. And ﬁnally, our deep gratitude to our family members for their unﬂinching emotional support during the whole period.

Ankur Sharma Nihar Amin Praveen Khokher Shehjad Khan May 2011

ix

Certiﬁcate

This is to certify that the work contained in this report entitled ”Numerical Methods Implementation On CUDA” by Ankur Sharma (2007UCP132), Nihar Amin (2007UCP161), Praveen Khokher (2007UCP157) and Shehjad Khan (2007UCP113) has been carried out under my supervision and this work has not been submitted elsewhere for a degree.

May, 2011

Dr. Vijay Laxmi Department of Computer Engineering, Malaviya National Institute of Technology, Jaipur.

xi

ABSTRACT

Parallel computing is the process of dividing large problems into smaller ones and concurrently executing them. This implies that many computations are carried out simultaneously. The main objective of devising parallel algorithms is to check whether they give faster responses than their sequential versions. The implementation of numerical methods for heavy calculations on CUDA architecture and their comparison with time taken for the same calculations sequentially on the CPU is the basic aim of the project. The understanding of CUDA architecture and how mapping is done using threads and blocks is ﬁrst understood. Algorithms that can be implemented parallely are recognized, their sequential CPU codes are written and then their parallel implementation on CUDA architecture is done. Sets of data are used to study the time taken by both implementations and inferences are made. These are primarily on the basis of complexities of sequential algorithms and their method of implementation on CUDA. Some parallel algorithms give sufﬁcient speed up and some are slower than the sequential versions. The reasons and conclusions are inferred and optimizations that can be done are mentioned.

**Chapter 1 Overview Of CUDA Programming Model
**

1.1 Introduction

Compute Uniﬁed Device Architecture(CUDA) is an application programming interface to the graphical processors .It is basically a parallel computing architecture developed by Nvidia. The architecture emphasizes the thinking of working many threads slowly in parallel rather than running a particular thread very fastly.CUDA speciﬁc computations are performed on GPU(graphics processing units).The architecture favours applications which are compute intensive rather then memory intensive.It is a scalable programming model .Programmers generally use C for CUDA for executing the code on GPU. There are levels of abstarction in CUDA which are visible to the programmers:1. Thread level heirarchy 2. Memory level heirarchy 3. barrier synchronizations The basic advantage of using CUDA is to run the parallel fraction of a large code eﬃciently and quick.It basically follows the approach of dividing a large set of input data into blocks and execute the diﬀerent blocks in parallel.The main features to look out for in parallel processing of blocks are eﬃcient communication of data between diﬀrent blocks and between the threads of the same block,synchronization 1

2

Chapter 1 Overview Of CUDA Programming Model

between blocks and threads of a block.

CUDA executes the sequential part of the code on CPU ,while the parallel portion is executed on GPU.The GPU code is compiled by the open64 compiler that produces parallel thread execution(PTX) ﬁles to run on the GPU.Qualiﬁers are used to distinguish between the variables and functions of the CPU code and GPU code .CUDA operates on single instruction multiple data (SIMD ) architecture but the thread can diverge from this on the basis of conditional opeartors ,blockId and threadId.

1.2

Thread Level Heirarchy

The Thread level abstraction can be viewed as shown below in ﬁgure:-

Figure 1.1: Thread Level Heirarchy

The thread level abstraction on CUDA can be viewed as a grid of blocks containing threads.Each thread possesses a unique ID associated with it .A Block can contain upto maximum of 512 threads quadroF X1700GP GP Uarchitecture,a thread basically can have its unique Id in x, y ,z dimension ,ie threadIdx.x, hreadIdx.y, theadIdx.z.similarly a collection of blocks is called a grid and can contain blocks in all the three dimensions.The threads within a block can communicate with each other using the shared memory visible per block and can synchronize there execution using the inbuilt syncthreads() function.The execution between diﬀerent blocks launched by the kernel cannot be done using the synthreads()

Chapter 1 Overview Of CUDA Programming Model

3

function.Diﬀerent blocks communicate with each other using the device memory or the global memory.when a kernel is launched a grid of thread blocks gets created on the device with each thread block containing many threads .Both Fine grained data parallelism and coarse grained data parallelism can be implemented in CUDA .The threads provide Fine grained parallelism while the blocks provide coarse grained parallelism.

1.3

Memory Level Heirarchy

The memory level abstraction can be viewed as shown below in ﬁgure:-

Figure 1.2: Memory Level Heirarchy There are four diﬀerent types of memories shown above:registers,shared,global,constant(not including the texture memory).The global memory can be accessed by every thread,diﬀerent blocks and the CPU.The registers are speciﬁc to each thread and are the fastest type of memory.The shared memory is visible to a particular block and thus threads of a block can access the shared memory.Constant memory is faster than global memory but slower than registers and shared memory however, it can only be written to in host code. Device code can read constant memory but it can not write to it.The sizes of global and constant memeory can scale in Gb’s but the sizes of shared memory is very limited (usually upto 16Kb). The memory allocation and deallocation of the global memory is done by the host.Functions like cudaMemcpy() and cudaMalloc(),are used for the allocation and movement of data from or to the device .Identiﬁers like cudaMemcpyDeviceToHost are used guide the direction of data transfer The memory transfer functions

4

Chapter 1 Overview Of CUDA Programming Model

can be synchronous as well as asynchronous . Synchronous means the CPU can start its execution only after the entire data has been transfered to the GPU.

**Chapter 2 Implementation Of Matrix Multiplication Algorithm On CUDA
**

2.1 Introduction

Matrix multiplication have inherent parallelism in it and thus by using a parallel architecture we can compute the work in lesser time i.e achieve speed up. We multiply to matrix of size M x N and N x O and get a resulting matrix of dimension M x O. Its a necessary condition that the number of column of 1st matrix is equal to number of rows of 2nd matrix,otherwise multiplicationis not possible.

Figure 2.1: Thread Level Heirarchy INPUT- Two matrices say, A and B with dimensions M x N and N x O OUTPUT Final matrix with dimension M x O . 5

6

Chapter 2 Implementation Of Matrix Multiplication Algorithm On CUDA

2.2

Matrix proves to be advantageous in the implementation of following logics:-

1. Graph Theory 2. Probability theory and statistics 3. Symmetries and transformations of physics 4. MATLAB

2.3

Sequential matrix-multiplication:

Suppose we have to multiple two matrix A and B and get the ﬁnal result in matrix C. Then each element of C can be found by sum=sum+ mat1[i][k]*mat2[k][j]; mat3[i][j]=sum; here r1 is the number of rows of ﬁrst matrix and c2 is the number of coloums of second matrix for(i=0;i < r1;i=i+1) { for(j=0;j < c2;j=j+1) { sum=0; for(k=0;k < c1;k++) sum=sum+mat1[i][k]*mat2[k][j]; mat3[i][j]=sum; } }

2.4

Parallel matrix-multiplications on CUDA:-

As matrix multiplication have many independent stages thus we can think of getting some speed-up using parallel architecture like CUDA.

Chapter 2 Implementation Of Matrix Multiplication Algorithm On CUDA

7

2.4.1

Implementation:

We launch same number of threads as the number of element in a resultant matrix. Each thread simultaneously calculate the the corresponding index of the resultant matrix. Our blocks are of 2D nature and have dimension N x O (here we have taken input values such that N and O both are equal). Both the dimensions of 2D grid is equal to sqrt(total number of blocks lauched).Indexing to each element is done using the threadIdx, threadIdx , blockIdx and blockIdx. dim3 threads(My block blocksX,My block); ﬂoat grid D=sqrt(My block); dim3 grid(grid D,grid D);

Indexing int row = blockIdx*block D +threadIdx; int col = blockIdx*block D +threadIdx;

2.5

A)

Kernel Speciﬁcations:

global void matrixMul globalmemory - 9 registers,28+16 bytes of smem,4

bytesof cmem[1].

2.6

Salient Features:

1. We have implemented on global memory as our threads are independent of each other and we face no synchronisation problem. 2. Motivation for using global memory was to run our code for matrices with large dimensions. 3. The code is generalised to run on very lage number of values. 4. Both the times t1 (without considering memory copy overhead) and t2 (considering memory transfers overhead) are calculated .

2.7

Limitations:

1. For lage values of arrays(>512 values),the input size was limited to the multiples of 512.

8

Chapter 2 Implementation Of Matrix Multiplication Algorithm On CUDA 2. GPU-occupancy of 67% was achieved as calculated by the GPU-OCCUPANCY CALCULATOR.

2.8

Observations:

1. Immediate speedUp for N>32,due to n3 complexity of sequential algorithm. 2. Sequential time almost linearly proportional to size of resultant matrix. 3. Initial

speedU p N

ie the slope of the speedUp graph is very steep.

4. With the increase in size of the input ,time taken by sequential code increases almost linearly,whereas the time taken by the kernel to execute remains a constant ,but the overall performance of the parallel code is degraded by the time acoounted for memorycopy overhead between host and device .

2.9

Conclusions:

1. As the sequential algorithm is of order n to the power 3 thus for large of dimensions we got a decent speed-up. 2. Parallel approach very favourable when sequential complexity is higher. 3. Even better speedUps can be achieved with memory optimization techniques.

Figure 2.2: execution time vs Input size

Chapter 2 Implementation Of Matrix Multiplication Algorithm On CUDA

9

Figure 2.3: SpeedUp vs input size

Figure 2.4: SpeedUp vs input size

**Chapter 3 Implementation Of Preﬁx Sum Algorithm On CUDA
**

3.1 Introduction

Preﬁx sum also known as the partial sum of the series is in programming terms the fold of addition operation.Th preﬁx sum is considered to be the simplest and most useful block of parallel algorithms.The preﬁx sum can be calculated for a very large sets of input data and is generally described a s below:For a set of N values { a1 ,a2 ,a3 ,a4 ..................................an } preﬁx-sum can be calculated as { a1 ,(a1 +a2 ),(a1 +a2 +a3 ),.....(a1 +.....an -1} For Example - a[8]={1,3,4,2,6,3,7,1} preﬁx-sum ={1,4,8,10,16,19,26,27} Preﬁx-sum proves to be advantageous in the implementation of following logics:1. In the implementation of radix sort quick sort . 2. Performing lexical analysis and search for regular expressions. 3. In evaluating polynomials ,solving recurrences and addition of multiprecision numbers . 4. It can be very much helpful in performing string matching algorithms.

11

12

Chapter 3 Implementation Of Preﬁx Sum Algorithm On CUDA

3.2

Sequential Preﬁx-sum algorithm:

The sequential preﬁx-sum algorithm is a very simple method to calculate the preﬁx-sum of a given input array of numbers ,just by looping through the size of the array and adding the current value with that of the previous indexed value .The logic is demonstrated as below:for( i=1;i<size;i=i+1) a[i]=a[i]+a[i-1];

This code performs exactly N adds for a array of size N.and thus is a very simple implementation.

3.3

Parallel Preﬁx-Sum On CUDA:

The preﬁx-sum algorithm can be very eﬃciently performed using the parallel architecture.We just need to divide the input array into blocks of proper dimension.and launch the kernel.

3.3.1

Implementation-

**For a input array of size N(can be very large),a single dimension grid is created
**

N with ( 512 ) blocks.If the size of the input is N<512 ,then a grid with one block and containing N number of threads is launched by the kernel function.

Each of the blocks is provided with a shared array of size=512 and its set of shared variables.All the values of the input array which are stored in global memory are mapped with a speciﬁc thread ID dependent on the number of blocks ID=blockIdx*dim block + threadIdx; Thus,respective elements are copied from the global memory to the shared memory of each block. The parallel sums of values in each block is generated and stored in a global array according to the respective block index.

3.4

1.

Kernel Speciﬁcations:

global Sum preﬁx() - 6 registers,4120+16bytes of smem,4 bytes of cmem[1]

Chapter 3 Implementation Of Preﬁx Sum Algorithm On CUDA 2.

13

global void sum()- 5 registers,2076+16 bytes of smem, 8bytes of cmem[1]

3.5

Salient Features:

1. The use of shared memory to perform consecutive reads,which reduces the time that would have been spent in performing the same reads and write using global memory. 2. Performing a proper synchronization between threads operating in parallel inside a block. 3. It was diﬃcult to perform synchronization between diﬀerent blocks ,so the sums of previous blocks were propagated to the consecutive blocks using a global array . 4. The code is generalised to run on very large number of values. 5. Both the times t1 (without considering memory copy overhead) and t2 (considering memory transfers overhead) are calculated .

3.6

Limitations:

1. For lage values of arrays(>512 values),the input size was limited to the multiples of 512. 2. gpu-occupancy of 67 % was achieved as calculated by the GPU-OCCUPANCY CALCULATOR..

3.7

Observations:

faster then the parallel code

1. For very small input sizes ,the sequential preﬁx sum appears to be much

2. With the increase in size of the input ,time taken by sequential code increases almost linearly,whereas the time taken by the kernel to execute remains a constant.

14

Chapter 3 Implementation Of Preﬁx Sum Algorithm On CUDA 3. Very large speedup wrt. kernel execution times are achieved,which demostrates the eﬃciency of running the parallel code on cuda ,but the memory overhead for large values limits the overall speed up .

3.8

Conclusions:

1. Using eﬃecient memory optimizing techniques,the memory transfer overhead between the host and the device can be reduced. 2. Using much better kernel optimization speedUp can be increased.

Figure 3.1: Preﬁx-sum algorithm

Chapter 3 Implementation Of Preﬁx Sum Algorithm On CUDA

15

Figure 3.2: Preﬁx-sum algorithm

Figure 3.3: Preﬁx-sum algorithm

**Chapter 4 Implementation Of Bitonic Sort Algorithm On CUDA
**

4.1 Introduction

It is a fast method to sort the large number of values. Basically contains two types of operations which are shown by down arrow(also by (+) operation ,just a symbolic representation) and up arrow(also by (-) operation). In + operation both the values are compared and after comparsion larger value should be at higher index (for this purpose swapping might be required).In - operation both the values are compared and larger value should be at the lower index(again swapping may or may not be required). INPUT:- Array of N element say A OUTPUT:- Sorted array of A, say sort(A)= such that for (i and j )=0 to n-1 a(i)<=a(j) for i<j

Bitonic-sort proves to be advantageous in the implementation of following logics:1. In any application which requires sorted input as for example binary search algorithm. 2. In forming directory and managing large data. 17

18

Chapter 4 Implementation Of Bitonic Sort Algorithm On CUDA

Figure 4.1: Sample Bitonic Sorting

4.2

Parallel Bitonic-Sort On CUDA:

The parallel bitonic-sort can be very eﬃciently performed using the parallel CUDA architecture.For N number of element , we can divide our problem into log to the (base 2 ) of N number of stages,and further each stage can be divided into number of substages. For stage i number of sub- stages in it are equal to i, i.e if we have 8 elements then total number of stages are 3. 1st stage have 1 sub-stage,2nd stage have 2 sub-stages and 3rd stage have 3 sub-stages. Each sub-stage has to do N/2 number of independent computations. Thus we can lauch N/2 number of threads for these N/2 computations. But sub-stages are not independent from each other and thus we have to ensure proper synchronization between threads , otherwise we will get incorrect results. As in our CUDA architecture we can only at maximum have 512 threads in a block thus, for values larger than 512, we have to launch multiple number of blocks. As we feel we have to perform interblock synchronization,which we have tried but can’t implement it so we have computed result only upto 512 values. We have to ﬁnd whether the thread has to perform (+) or (-) operations. For this purpose we have used a ﬂag variable in our kernel flag=(int)(id/power(i))%2; If ﬂag has value 0 then we have to perform (+) operation,otherwise the (-) operation. Threads in blocks are of 1D nature and can be accessed by indexing them using threadIdx.x indexing id = threadIdx ;

Chapter 4 Implementation Of Bitonic Sort Algorithm On CUDA

19

For synchronisation of threads of the same block we have used the standard library function syncthreads();

Figure 4.2: Kernel Used in Bitonic Sorting

4.3

Salient Features:

1. Diﬀerent sub-satge at the same stage level are not independent. 2. In last stage we only have to perform (+) operations.

4.4

Limitations:

1. We have assumed that the number of input value must be in form of 2’s power,like 4 ,8 , 16 , 32, 64, 128, 256, 512 2. As we have only used 1D block so at max we can take 512 values for the sorting. 3. gpu-occupancy of 67% was achieved as calculated by the GPU-OCCUPANCY CALCULATOR..

4.5

Observations:

1. SpeedUp gained for (N>256). 2. For sequential nearly linear increase in time with increasing N. 3. Very sharp increase in speedUp after (N=256).

20

Chapter 4 Implementation Of Bitonic Sort Algorithm On CUDA

4.6

Conclusions:

1. Speedup due to memory overhead decreases signiﬁcantly. 2. Much higher SpeedUps can be achieved with multiple blocks.

Figure 4.3: Execution time vs input size

Figure 4.4: slope of speedUp vs input size

Chapter 4 Implementation Of Bitonic Sort Algorithm On CUDA

21

Figure 4.5: speedUp vs input size

**Chapter 5 Implementation of Odd Even transposition Sort
**

5.1 Introduction

The network odd-even transposition sort for n input data consists of n comparing stages. In each stage, either all inputs at odd index positions or all inputs at even index positions are compared with their next element. Odd stages are followed by the even stages and only after the completion of an Odd stage an Even stage can start and vice versa. It is similar to the bubble sort except for the fact that odd-even transposition sort compares disjointed pairs by using alternating odd and even index values during diﬀerent phases of the sort.

5.2

The odd even merge sort is advantageous as it can

1. Can be used for sorting on 2-D processor arrays and 2. Be parallely implemented which can achieve speed ups of more than 2.0 even on marginally small number of elements.

23

24

Chapter 5 Implementation of Odd Even transposition Sort

5.3

Sequential Odd-Even Merge Sort:

The algorithm is simple to implement and is synonymous with bubble sort. In the ﬁrst phase of odd-even exchange, control jumps to all the even indices and compare their neighbouring element. In the second phase control jumps to odd indices and compares their neighbouring elements.These pair of phases continue till the array is sorted. Thus, there are exactly half the number of pair of phases as there are elements in the array to be sorted. The looping logic as follows for (i = 0; i< n ; i=i+1 ) 2 { for (j = 0; j+1<n; j=j+2) if (A[j]>A[j+1]) { int T=A[j]; A[j]=A[j+1]; A[j+1]=T; } for (j = 1; j+1< n; j=j+2) if (A[j]>A[j+1]) { int T = A[j]; A[j] = A[j+1]; A[j+1] = T; } }

5.4

Parallel Odd Even Transposition Sort:

The odd-even transposition sort on CUDA architecture is implemented on a single block with a max size of 512 elements. Each thread process one element and hence even threads process even indexed elements and odd threads process odd indexed elements.

Chapter 5 Implementation of Odd Even transposition Sort

25

5.4.1

Implemention

For an input size of N a block with N threads is created and each thread processes one element.The kernel creates a shared memory portion for the block and copies the array in this.All the values of the input array which are stored in global memory are mapped with a speciﬁc thread ID dependent on the number of blocks ID=blockIdx*dim block + threadIdx Thus, respective elements are copied from the global memory to the shared memory for the block. The kernel then sorts the array in combinations of odd-even phases and the resultant is copied back to the host memory.The kernel functioan can be examined as follows.

5.5

Kernel Speciﬁcation:

global Sort() - 8 registers,2068+16bytes of smem,4 bytes of cmem[1].

5.6

Salient Features:-

1. The use of shared memory to perform consecutive reads,which reduces the time that would have been spent in performing the same reads and write using global memory. 2. Performing a proper synchronization between threads operating in parallel inside a block. 3. It was diﬃcult to perform synchronization between diﬀrent blocks ,so the sums of previous blocks were propagated to the consecutive blocks using a global array. 4. Both the times t1 (without considering memory copy overhead) and t2 (considering memory transfers overhead) are calculated. 5. Synchronization done as to ensure that during parallel execution of threads the even phase always follows the odd phase

26

Chapter 5 Implementation of Odd Even transposition Sort

5.7

Limitations:

1. Maximum size of array can be 512, limited to maximum threads in a block 2. gpu-occupancy of 67% was achieved as calculated by the GPU-OCCUPANCY CALCULATOR..

5.8

Observations:

1. Steep increase in speedUp as N increases. 2. Due to N being limited to 512 memory overhead time is less than calculation time .Therefore less eﬀect of memory overhead in performance graph

5.9

Conclusions:

gains recognizable speedUp.

1. Due to calculative complexity in sequential approach,the parallel approach

2. Due to N being limited to 512 memory overhead time is less than calculation time .Therefore less eﬀect of memory overhead in performance graph

Figure 5.1: Execution time vs input size

Chapter 5 Implementation of Odd Even transposition Sort

27

Figure 5.2: speedUp vs input size

**Chapter 6 Implementation Of Parallel Quicksort By Regular Sampling Algorithm On CUDA
**

6.1 Introduction

Quicksort (also known as partition -exchange sort)is a very well known sorting algorithm developed by A.R Hoare.It is a comparison sort and in eﬃecient implementations ,is not a stable sort.Quicksort tends to make a excellrnt usage of memory heirarchy ,taking a perfect advantage of virtual memory and availible caches .It is very well suited for modern computer architectures ,as it uses no temporaray memory and thus is a in-place sort.

6.2

Sequential Quicksort:

The sequential implementation of quicksort algorithm follows a divide and conquer approach to sort a large input array of values.Th procedure involves:1. Selecting one of the numbers (any random numbermay be selected) from the input as pivot element. 2. Locating the index(position) of the number in the input array and then dividing the array into sub-arrays .the Lower sub array contains elements with 29

30

Chapter 6 Implementation Of Parallel Quicksort By Regular Sampling Algorithm On CUDA value smalller then the pivot,and the upper sub array containing elements with values higher then that of the pivot element . 3. Applying the step one recursively on both the lower and upper arrays. 4. Finally a sorted list of values is obtained(sorted here in ascending order).

ILLUSTRATION OF QUICKSORT

Figure 6.1: Sequential Quicksort algorithm Quicksort is known to be the fastest sorting algorithm based on comparison of pivots, in the average case and Quicksort has some natural concurrency(sorting the lower and upper list concurrently).

6.3

Parallel Quicksort Using Regular Sampling:

Parallel quicksort using regular sampling can be applied on a very large sets of data .It basically involves segmenting the unsorted list into blocks.The unsorted list is evenly distributed among the blocks.There are in all four phases invloved :1. Individual sorting of values on each segment ,selecting data items at local n indices 0, p2 , 2∗n , . . . , (p−1)n as a regular sample of its locally sorted block. p2 p2 2. All the selected pivots are then again sorted and (p-1) pivots are selected and broadcast to every block. 3. Each Block then partitions its sorted subarray into P disjoint partitions 4. Each Block (i) keeps its (ith ) partition and sends the (j th ) partition to process (j), for all (j=i) and then each block merges its P partitions into a single global array.

Chapter 6 Implementation Of Parallel Quicksort By Regular Sampling Algorithm On CUDA

31

6.3.1

Implementation:

1. The input Unsorted list is divided into N blocks ( size ) .and the unsorted 512 partitions are then copied from the global array to the shared array of each block on the GPU. 2. Sorting of the segemented list stored in shared array is performed by every block independent of each other 3. Local pivots are selected and copied to a global array ,indexed according to the blockId. 4. The list of pivots is then again sorted and P-1 pivots are agin selected and brodcast to every block. 5. Local sorted arrays are partioned according to the pivots and then the partitions are merged to a global array accordingly.

6.4

1. 2. 3.

Kernel Speciﬁcations:

kernel1 6 registers,6810+16 bytes smem,4 bytes cmem kernel2 8 registers,24+16 bytes smem,4 bytes cmem kernel3 7 registers,2084+16 bytes smem,8 bytes cmem

6.5

Salient features:

1. The use of shared memory to perform consecutive reads,which reduces the time that would have been spent in performing the same reads and write using global memory. 2. The code is generalised to run on very large number of values. 3. Better load balance 4. Repeated communications of a same value are avoided 5. Use of three kernel functions to increase the extent of parallelization at the same time continuosly using shared memory.

32

Chapter 6 Implementation Of Parallel Quicksort By Regular Sampling Algorithm On CUDA

6.6

Limitations:

1. The input size is limited to be taken in multiples of 512. 2. The sorting of segmented array performed at block level is implemented using a single thread ,this aﬀecting the overall eﬃciency and reducing parallelism. 3. Better load balance 4. There is a constant use of global memory for broadcasting the pivots and globally sorting them 5. GPU-Occupancy of 67% was achieved as calculated by the GPU-OCCUPANCY CALCULATOR..

6.7

Observations:

1. Highly eﬃcient and recursive sequential code 2. Use of three kernels drastically increses the execution time

6.8

Conclusions:

1. Eﬃcient sequential codes can outperform the parallel versions.

Chapter 6 Implementation Of Parallel Quicksort By Regular Sampling Algorithm On CUDA

33

Figure 6.2: execution time vs input size

Figure 6.3: speedUp vs input size

34

Chapter 6 Implementation Of Parallel Quicksort By Regular Sampling Algorithm On CUDA

Figure 6.4: speedUp vs input size

**Chapter 7 Implementation of matrix transpose algorithm on CUDA
**

7.1 Introduction

Matrix transpose is a operation in which we exchange the rows with there corresponding column i.e values in row 1s t becames the values of column 1s t. Transpose can opnly be found for a square matrix i.e both the dimension of matrix should be same. The matrix transpose can be calculated for a very large sets of input data and is generally described as below:INPUT: Matrice A having N*N dimension OUTPUT Matrice transpose(A) having same dimensions. 1s t row of A must match with 1s t column of transpose (A) and so on. Example:matrix A=

1 2 3 4 5 6 7 8 9 1 4 7

transpose (A)= 2 5 8 3 6 9

35

36

Chapter 7 Implementation of matrix transpose algorithm on CUDA

7.2

Matrix transpose proves to be advantageous in the implementation of following logics:

1. Used to ﬁnd inverse of matrix. 2. Orthogonal matrice applications.

7.3

Sequential matrix transpose:

The logic for sequential is pretty straight-forward as the rows and colum are exchanged hence basically we have swapped there two indexs, i.e A[i][j]=transpose(A[j][i]); thus we have to index our program to follow the above logic. for(i=0;i< r1 ;i=i+1) { for(j=0;j< c1 ;j=j+1) { transpose(A[j][i])=A[i][j]; } }

here r1 = number of rows in A matrice and c1 number of column and we know both must be equal as its a square matrix

7.4

Parallel matrix transpose:

As matrice A and transpose(A) are diﬀerent and thus we can launch as many threads as there are number of element and thus we dont even have to synchronize them.

7.4.1

Implementation:

For a input array of size N(can be very large),a 2-D grid is created . If the square of N<512 ,then a grid with one block and containing N*N number of threads

Chapter 7 Implementation of matrix transpose algorithm on CUDA

37

is launched by the kernel function.If (N*N>512) then number of block launched are N ∗N and a 2-D block of each with dimesion 16 is launched.Indexing to each 256 element is done using the threadIdx, threadIdx , blockIdx and blockIdx. Indexing int row = blockIdx*block D +threadIdx; int col = blockIdx*block D +threadIdx;

7.5

Kernel speciﬁcations:

global void matrixMul globalmemory - 9 registers,28+16 bytes of smem,4 bytes of cmem[1].

7.6

Salient features:

1. we have implemented on global memory as our threads are independent of each other and we face no synchronisation problem. 2. Motivation for using global memory was to run our code for matrices with large dimensions. 3. The code is generalised to run on very lage number of values. 4. Both the times t1 (without considering memory copy overhead) and t2 (considering memory transfers overhead) are calculated .

7.7

Limitations:

1. For lage values of arrays(>512 values),the input size was limited to the multiples of 512. 2. gpu-occupancy of 67% was achieved as calculated by the GPU-OCCUPANCY

7.8

Observations:

calculationtime memoryoverhead

1. As N increases calculative logic.

decreases signiﬁcantly.This is due to simple

2. Due to memory overhead speedUp did not increase beyond 0.91.

38

Chapter 7 Implementation of matrix transpose algorithm on CUDA

7.9

Conclusions:

1. SpeedUp in calculations at (CPU vs GPU) easily achieved 2. Better memory optimizations can gain signiﬁcant speedUp.

Figure 7.1: Execution time vs input size

Figure 7.2: speedUp vs input size

Chapter 7 Implementation of matrix transpose algorithm on CUDA

39

Figure 7.3: speedUp vs input size

**Chapter 8 Implementation of parallel sum algorithm on CUDA
**

8.1 Introduction

Parallel sum is the program to ﬁnd out the sum of all the elements present in an array. The parallel sum can be calculated for a very large sets of input data and is generally described as below:INPUT: For a set on N values [a1 ,a2 ,a3 ,.....................................,an -1,an ] OUTPUT We will get the ﬁnal sum of array say SUM=a1 +a2 +a3 ...........................+ an -1 + an ; For Example - a[8]={1,3,4,2,6,3,7,1} SUM={1+3+4+2+6+3+7+1}=27

8.2

Parallel-sum proves to be advantageous in the implementation of following logics:

1. In the implementation of ﬁnding out mean of set of values . 2. In the implementation of ﬁnding of variance. 41

42

Chapter 8 Implementation of parallel sum algorithm on CUDA

8.3

Sequential Parallel-Sum Algorithm:-

The sequential parallel-sum algorithm is a very simple method to calculate the total sum of a given input array of numbers ,just by looping through the size of the array and adding the current value with the variable sum .The logic is demonstrated as below:SUM=0; for( i=0;i<size;i=i+1) SUM=a[i]+SUM ;

This code performs exactly N adds for a array of size N.and thus is a very simple implementation.

8.4

Parallel Preﬁx-Sum:

The preﬁx-sum algorithm can be very eﬃciently performed using the parallel architecture.We assume our size of input array to in form of powers of two, i.e 2 ,4 ,16 , 32 ....1024,...8192...and so on.

8.4.1

Implementation:

**For a input array of size N(can be very large),a single dimension grid is created
**

N with ( 512 ) blocks.If the size of the input is N<512 ,then a grid with one block and containing N number of threads is launched by the kernel function.

Basically in kernel function each thread executes it code by performing the sum of two elements and storing that sum in the index of number with lower index. For example: if we have input array say A={1,2,3,4,5,6,7,8} Now in ﬁrst run for 8 values we create 4 threads ﬁrst thread,i.e thread with the threadIdx=0 adds the value (a[0]=a[0]+a[1]=1+2=3) and stores it at the lower index i.e 0;similarly second thread(threadIdx=1) adds the value (a[2]=a[2]+a[3]=3+4=7) third thread(threadIdx=2) adds the value (a[4]=a[4]+a[5]=5+6=11) fourth thread(threadIdx=3) adds the value (a[6]=a[6]+a[7]=7+8=15) Now the number of values have reduced from 8 to 4 now we require only 2 threads instead of 4. this is done by using thread Ids

Chapter 8 Implementation of parallel sum algorithm on CUDA

43

of threads. Condition: if((int)(threadIdx)-power(j)geq0 here j denotes the value of run i.e for 1s t run its equal to 0 for second run its equal to 1 and so on. As we observe each time number of values reduces by 2. thus to compute the sum of N value we need log to the base 2 of value N. Each of the blocks is provided with a shared array of size=512 and its set of shared variables.All the values of the input array which are stored in global memory are mapped with a speciﬁc thread ID dependent on the number of blocks ID=blockIdx*dim block + threadIdx; Proper synchronisation must be insured between the diﬀrent run of threads. We have used the standard function from CUDA library ( syncthreads();)

8.5

1.

Kernel Speciﬁcation:global void sum()- 5 registers,2076+16 bytes of smem, 8bytes of cmem[1].

8.6

Salient Features:-

1. The use of shared memory to perform consecutive reads,which reduces the time that would have been spent in performing the same reads and write using global memory. 2. Performing a proper synchronization between threads operating in parallel inside a block. 3. The code is generalised to run on very lage number of values. 4. Both the times t1 (without considering memory copy overhead) and t2 (considering memory transfers overhead) are calculated .

8.7

Limitations:-

1. We have assumed that the number of input value must be in form of 2’s power.

44

Chapter 8 Implementation of parallel sum algorithm on CUDA 2. We can run it for large values untill the condition of maximum number of blocks occurs, i.e we can have at max number of blocks is 65536 thus we can compute parallel sum of array having 65536*512=33554432 number of elements. 3. GPU-occupancy of 67% was achieved as calculated by the GPU-OCCUPANCY CALCULATOR..

8.8

Observations:

1. For very small input sizes ,the sequential sum appears to be much faster then the parallel code . 2. Good speedup wrt. kernel execution times are achieved,which demostrates the eﬃciency of running the parallel code on CUDA.

8.9

Conclusions:

(a) use of shared memory requires extreme synchronization logics. (b) bank conﬂicts very comon due to unrestricted access of shared memory.

Figure 8.1: Execution time vs input size

Chapter 8 Implementation of parallel sum algorithm on CUDA

45

Figure 8.2: speedUp vs input size

Figure 8.3: speedUp vs input size

**Chapter 9 Calculation Of Variance and Standard Deviations on CUDA
**

9.1 Introduction

The mean of a data set is simply the arithmetic average of the values in the set, obtained by summing the values and dividing by the number of values. The mean is a measure of the center of the distribution. The variance is used as a measure of how far a set of numbers are spread out from each other. It gives a measure of how away or far the numbers lie from their mean. The variance of a data set is the arithmetic average of the squared diﬀerences between the values and the mean Standard deviation gives a measure of how much variation or dispersion is there from the mean. Mathematically it is the square root of the variance. The variance and the standard deviation are both measures of the spread of the distribution about the mean

9.2

Finding VARIANCE AND DEVIATION

proves to be advantageous

(a) The spread of the data around the mean is to be found (b) When large data is to be analyzed on the basis of extent of the spread in the data 47

48

Chapter 9 Calculation Of Variance and Standard Deviations on CUDA (c) For example, the margin of error in polling data is determined by calculating the standard deviation in the results if the polling is to be done multiple times.

9.3

Sequentially Calculate Variance and SD:

the sum is easily calculated by adding each element of the N sized array and the mean is found by dividing this dum by N. for(i=0; i<n; i=i+1) { sum = sum + A[i]; } avrg =

sum n

;

the variance is then calculated using the deviation from this mean value by using the formula stated above. The looping would be: for(i=0; i<n; i++) { sum1+=(A[i]-avrg)*(A[i]-avrg); } var = SD =

sum1 n

;

(var)

the SD is the standard deviation which is the square root of the variance.

9.4

Parallely Calculate Variance and SD:

The process of ﬁnding the sum parallely on CUDA is a complex one due to synchronization problems. The sum is calculated using the kernel described in chapter 3. the sum gives the average by dividing the sum by N and this is used by the 2nd kernel for the calculation of variance and SD.

9.4.1

Implementation:

For a input array of size N(can be very large),a single dimension grid is N created with ( 512 ) blocks.If the size of the input is N<512 ,then a grid with

Chapter 9 Calculation Of Variance and Standard Deviations on CUDA

49

one block and containing N number of threads is launched by the kernel function. Each of the blocks is provided with a shared array of size=512 and its set of shared variables.All the values of the input array which are stored in global memory are mapped with a speciﬁc thread ID dependent on the number of blocks ID=blockIdx*dim block + threadIdx; Thus, respective elements are copied from the global memory to the shared memory of each block. The average calculated by kernel 1 is passed on to the kernel 2 and the variance of each block is calculated and stored in an array. Its summation gives the variance of the data and the square root of the variance gives the SD.

9.5

(a)

Kernel Speciﬁcation:

global void sum()- 5 registers,2076+16 bytes of smem, 8bytes of cmem[1].

9.6

Limitations:

(a) For lage values of arrays(>512 values),the input size was limited to the multiples of 512. (b) GPU-occupancy of 67% was achieved as calculated by the GPU-OCCUPANCY.

9.7

Observations:faster then the parallel code .

(a) For very small input sizes ,the sequential preﬁx sum appears to be much (b) With the increase in size of the input ,time taken by sequential code increases almost linearly,whereas the time taken by the kernel to execute remains a constant ,but the overall performance of the parallel code is degraded by the time acoounted for memorycopy overhead between host and device . (c) Very large speedup wrt kernel execution times are achieved,which demostrates the eﬃciency of running the parallel code on cuda ,but the memory overhead for large values limits the overall speed up .

50

Chapter 9 Calculation Of Variance and Standard Deviations on CUDA

9.8

Conclusions:

there is no speed up achieved as the kernel for ﬁnding the sum has synchronization problems to be met.

(a) Finding the mean, variance and SD sequentially is of the O(n). hence

(b) Memory optimization techniques can be used to control synchronization of shared memory and speed up may be achieved but not guaranteed.

Figure 9.1: Execution time vs input size

Figure 9.2: speedUp vs input size

Chapter 9 Calculation Of Variance and Standard Deviations on CUDA

51

Figure 9.3: speedUp vs input size

**Chapter 10 Data of Algorithms
**

The CPU we used has the following speciﬁcations: Processor Memory L2 Cache : : : Intel(R) Core(TM)2 Duo CPU E6550 @ 2.33GHz 1GB DDR2 RAM 4 MB

The Nvidia Quadro FX 1700 GPGPU we used has the following speciﬁcations: CUDA Parallel Processor Cores Memory Size Memory Interface Graphics Memory Bandwidth : : : : 32 512 MB 128-bit 12.8 GB/sec

The graphics card used for our experiment (Quadro FX 1700) is of compute capability 1.1. The version does not support double precision ﬂoating point. Also,the mathematical functions used are not accurate. This leads to mild loss of accuracy in the ﬁnal results.

53

54 Input 4 8 16 32 64 128 256 512 1024 2048 SeqEx-time 1 554 2360 9414 37784 131292 538807 2378744 11560038 52087845 PEx-time1 43 924 2118 8160 32041 133058 526462 2118810 8538991 34331100 PEx-time1 67 1009 2250 8405 32486 133952 528415 2122760 8547882 34357273

Chapter 10 Data of Algorithms Speed-up1 21 0.17 0.60 1.11 1.18 0.99 1.02 1.12 1.35 1.52 Speed-up2 2 0.15 0.55 1.05 1.01 0.98 1.02 1.12 1.35 1.52

Table 10.1: Matrix Multiplication(time in 10−6s) Input 4 8 16 32 64 128 256 512 SeqEx-time 1 2 6 13 33 77 179 423 PEx-time1 51 61 77 94 120 147 182 251 PEx-time1 190 200 226 243 280 297 332 402 Speed-up1 0.02 0.03 0.08 0.14 0.28 0.52 0.98 1.69 Speed-up2 0.01 0.01 0.03 0.05 0.12 0.26 0.54 1.05

Table 10.2: Bitonic Sort Algorithm (time in 10−6 s) Input SeqEx-time 16 1 32 1 64 2 128 3 256 4 512 7 1024 14 2048 28 4096 54 8192 108 16384 215 32768 430 65536 858 262144 2956 524288 5922 PEx-time1 76 68 75 81 98 146 151 151 250 467 934 1958 4503 33192 107130 PEx-time1 97 93 98 105 123 172 179 179 301 553 1075 2266 5087 35403 111562 Speed-up1 0.01 0.01 0.03 0.04 0.04 0.05 0.09 0.19 0.22 0.23 0.23 0.22 0.19 0.09 0.06 Speed-up2 0.01 0.01 0.02 0.03 0.03 0.04 0.08 0.16 0.18 0.20 0.2 0.19 0.17 0.08 0.05

Table 10.3: Preﬁx Sum (time in 10−6 s)

Chapter 10 Data of Algorithms

55

Input 4 8 16 32 64 128 256 512

SeqEx-time 1 1 3 9 32 113 446 1786

PEx-time1 43 47 50 58 78 144 282 838

PEx-time1 67 73 73 83 105 67 67 67

Speed-up1 0.02 0.02 0.06 0.16 0.41 0.96 1.75 2.21

Speed-up2 0.01 0.01 0.04 0.11 0.30 0.78 1.58 2.13

Table 10.4: Odd-Even Transposition Sort (time in 10−6 s)

Input SeqEx-time 4 1 16 2 32 3 64 7 256 32 512 68 1024 144 2048 290 8192 1252 32768 5392 131072 23079

PEx-time1 63 177 549 1023 2000 2608 4608 7500 17062 29865 92452

PEx-time1 87 201 574 1030 2018 2646 4698 7568 17124 29936 92498

Speed-up1 0.02 0.01 0.01 0.01 0.02 0.03 0.03 0.04 0.07 0.18 0.25

Speed-up2 0.01 0.01 0.01 0.01 0.02 0.03 0.03 0.04 0.07 0.18 0.25

Table 10.5: Quicksort (time in 10−6 s)

Input 4 8 16 32 64 128 256 512 1024 2048 4096

SeqEx-time 1 1 3 8 33 127 458 2057 9906 45757 202233

PEx-time1 45 45 46 47 58 147 411 1528 6076 24740 206395

PEx-time1 71 72 74 79 113 298 1045 3748 13738 50454 307262

Speed-up1 0.02 0.02 0.07 0.17 0.57 0.86 1.11 1.35 1.36 1.85 0.98

Speed-up2 0.01 0.01 0.04 0.10 0.29 0.43 0.44 0.55 0.72 0.91 s0.82

Table 10.6: Matrix-transpose (time in 10−6 s)

56

Chapter 10 Data of Algorithms

Input SeqEx-time 16 1 64 1 256 2 512 3 1024 6 4096 22 8192 41 16384 83 32768 168 262144 1155 1048576 4650 4194304 18459 16777216 73597

PEx-time1 53 56 70 87 89 136 227 418 768 5829 23189 92547 369930

PEx-time1 80 79 98 113 117 190 312 572 1102 8045 30856 118429 470787

Speed-up1 0.02 0.02 0.03 0.03 0.07 0.16 0.18 0.20 0.22 0.20 0.20 0.20 0.20

Speed-up2 0.01 0.01 0.02 0.03 0.05 0.12 0.13 0.15 0.15 0.14 0.15 0.16 0.16

Table 10.7: Summation Algorithm (time in 10−6 s)

Input SeqEx-time 512 7 1024 13 2048 25 4096 49 8192 98 16384 166 32768 384 65536 767 131072 1323 262144 2639 1048576 10651 4194304 42947 16777216 171841

PEx-time1 272 273 281 380 582 997 1767 3415 6666 13148 51131 200280 799691

PEx-time1 294 297 311 418 639 1090 1976 3813 7427 14643 55262 214462 854827

Speed-up1 0.03 0.05 0.09 0.13 0.17 0.17 0.22 0.22 0.20 0.20 0.21 0.21 0.21

Speed-up2 0.02 0.04 0.08 0.12 0.15 0.15 0.19 0.20 0.18 0.18 0.19 0.20 0.20

Table 10.8: Variance and SD (time in 10−6 s)

Bibliography

57