P. 1
GPU Accelerated Linear Algebra

GPU Accelerated Linear Algebra

|Views: 1,009|Likes:
Optimising performance of Matrix-multiplication, LU- and QR-decomposition using Cuda.
Optimising performance of Matrix-multiplication, LU- and QR-decomposition using Cuda.

More info:

Published by: Mikkel Bundgaard-Ovesen on Aug 30, 2011
Copyright:Attribution Non-commercial

Availability:

Read on Scribd mobile: iPhone, iPad and Android.
download as PDF, TXT or read online from Scribd
See more
See less

05/23/2013

pdf

text

original

Table of contents

Table of contents .............................................................................................. i List of figures................................................................................................... v List of tables................................................................................................... vi Summary.........................................................................................................1 1 Introduction ..............................................................................................3 1.1 1.2 1.3 1.4 1.5 Motivation ...........................................................................................3 Reading guide ......................................................................................3 Problem definition.................................................................................5 Method ...............................................................................................6 Scope.................................................................................................6 Algorithms ....................................................................................6 Numerical stability ..........................................................................6 IEEE 754 and double-precision ............................................................7 BLAS ............................................................................................7

1.5.1 1.5.2 1.5.3 1.5.4 2

Background ...............................................................................................8 2.1 2.2 Linear algebra ......................................................................................8 GPU computing .....................................................................................8

3

Parallel platforms..................................................................................... 10 3.1 Cuda ................................................................................................ 10 History ....................................................................................... 10 Version....................................................................................... 10 Cuda program .............................................................................. 11 Architecture ................................................................................ 13 Limitations .................................................................................. 16

3.1.1 3.1.2 3.1.3 3.1.4 3.1.5 3.2

GPU.NET ........................................................................................... 17 Overview .................................................................................... 17 Development ............................................................................... 18

3.2.1 3.2.2

i

3.2.3 3.2.4 3.2.5 4

Execution ................................................................................... 19 Limitations and bugs ...................................................................... 19 Evaluation ................................................................................... 20

Hardware platform ................................................................................... 21 4.1 4.2 Analysis ............................................................................................ 22 Benchmarking .................................................................................... 23 Memory performance ..................................................................... 23 Arithmetic performance .................................................................. 24

4.2.1 4.2.2 5

Implementation ....................................................................................... 26 5.1 5.2 5.3 Development environment ..................................................................... 26 Design decisions .................................................................................. 26 Optimisation ...................................................................................... 27 Strategy ..................................................................................... 27

5.3.1 6

Matrix-multiplication ................................................................................ 29 6.1 Analysis ............................................................................................ 30 The sequential algorithm ................................................................ 30 Parallelism .................................................................................. 31

6.1.1 6.1.2 6.2

Simple algorithm ................................................................................. 32 The algorithm .............................................................................. 32 Test and results ............................................................................ 33

6.2.1 6.2.2 6.3

Optimisation ...................................................................................... 36 Unroll loop with threads ................................................................. 36 Tiling v1 ..................................................................................... 38 Tiling v2 with latency hiding ............................................................ 41 Tiling v3 with prefetching................................................................ 42 Tiling v4 and v5 with more output per thread ....................................... 43 Cuda compute capability ................................................................. 45

6.3.1 6.3.2 6.3.3 6.3.4 6.3.5 6.3.6 6.4 7

Evaluation ......................................................................................... 46

LU decomposition ..................................................................................... 48 7.1 Analysis ............................................................................................ 48

ii

7.1.1 7.1.2 7.2

The sequential algorithm ................................................................ 49 Parallelism .................................................................................. 49

Simple algorithm ................................................................................. 51 The algorithm .............................................................................. 51 Test and results ............................................................................ 53

7.2.1 7.2.2 7.3

Block LU-decomposition ........................................................................ 55 The block algorithm ....................................................................... 55 Implementation ............................................................................ 56 Test and results ............................................................................ 59 Optimising round 1 ........................................................................ 61 Test and results ............................................................................ 63 Optimising round 2 ........................................................................ 64 Further optimisation ...................................................................... 65 Large matrices ............................................................................. 68

7.3.1 7.3.2 7.3.3 7.3.4 7.3.5 7.3.6 7.3.7 7.3.8 7.4 8

Evaluation ......................................................................................... 69

QR decomposition ..................................................................................... 71 8.1 Analysis ............................................................................................ 71 The sequential algorithm ................................................................ 72 Parallelism .................................................................................. 73

8.1.1 8.1.2 8.2

Simple algorithm ................................................................................. 74 The algorithm .............................................................................. 74 Test and results ............................................................................ 75

8.2.1 8.2.2 8.3

Optimisation ...................................................................................... 76 Test and results ............................................................................ 76

8.3.1 8.4

Block QR-decomposition ........................................................................ 77 The block algorithm ....................................................................... 77 Implementation ............................................................................ 79

8.4.1 8.4.2 8.5 9

Evaluation ......................................................................................... 80

Evaluation .............................................................................................. 81 9.1 Cuda ................................................................................................ 81

iii

9.2 10

GPU.NET ........................................................................................... 82

Discussion and future work.......................................................................... 83 10.1 10.2 10.3 10.4 Project ............................................................................................. 83 Cuda ................................................................................................ 83 Hardware .......................................................................................... 84 Future of GPGPU ................................................................................. 84

11

Conclusion .............................................................................................. 86

Bibliography and references ............................................................................... 87 Appendix A – Project evaluation .......................................................................... 89 Appendix B – Implementation considerations .......................................................... 90 Cuda thread organisation ............................................................................... 91 SIMT and warp size ....................................................................................... 93 Elapsed time ............................................................................................... 93 Pinned or page-locked memory ........................................................................ 94 Matrix structure ........................................................................................... 94 Appendix C – Hardware specification description and analysis ..................................... 95 Platform #1 ............................................................................................. 95 Platform #2 ............................................................................................. 96 Platform #3 ............................................................................................. 96 Platform evaluation ................................................................................... 97 Specifications .......................................................................................... 98 Evaluation ............................................................................................. 101 Appendix D – Development environment problems and solution model .......................... 102 Development model ..................................................................................... 102 Cuda C and C++ ....................................................................................... 102 Appendix E – CGMA and Cuda profiler .................................................................. 104 Appendix F – Matrix-multiplication CC levels ......................................................... 106 Appendix G – Report page count ......................................................................... 107

iv

List of figures
Figure 1 - Cuda program sequence diagram............................................................. 12 Figure 2 - Cuda architecture with four multiprocessors .............................................. 14 Figure 3 - How GPU.NET works as describe on TidePowerd.com .................................... 17 Figure 4 – Simplified diagram of a chipset .............................................................. 22 Figure 5 - Matrix-multiplication process depicted ..................................................... 29 Figure 6 – The output of the console testing program ................................................ 34 Figure 7 - Performance of kernels executed for different CC levels on platform #4 ............ 45 Figure 8 – Performance of simple LU-decomposition on different platforms. .................... 53 Figure 9 – Matrix A being decomposed by block LU-decomposition in steps. ..................... 56 Figure 10 - Performance of block LU-decomposition v1 on different platforms. ................. 60 Figure 11 - Computing time of each kernel in block LU-decomposition v1 on platform #4. ... 61 Figure 12- Performance of block LU-decomposition v2 on different platforms. ................. 63 Figure 13 - Computing time of each kernel in block LU-decomposition v2 on platform #4. ... 64 Figure 14 - Performance of block LU-decomposition v3 on different platforms. ................. 65 Figure 15 – Showing the sub-matrix part of the triangular solve method. ......................... 66 Figure 16 – A 10.000 x 10.000 matrix LU-decomposed on platform #3 and #4. ................... 67 Figure 17 - Peak performance of LU-decomposition v3 on platform #3 ............................ 68 Figure 18 - Storage strategy for the compressed Householder QR-factorisation ................. 72 Figure 19 - Matrix A being decompose by block QR-decomposition in steps. ..................... 78 Figure 24 - Cuda thread organisation [4] ................................................................ 90 Figure 20 – Block diagram of a chipset. Source: Intel ................................................. 98 Figure 21 - CPU and bus details of platform #1 ........................................................ 99 Figure 22 - System memory details of platform #1 ................................................... 100 Figure 23 - GPU details of platform #1 .................................................................. 101

v

List of tables
Table 1 - Hardware specifications for the four platforms ........................................... 21 Table 2 - Measured bandwidth of Cuda memory transfer operations .............................. 23 Table 3 - Measured gigaflops performance of GPU .................................................... 25 Table 4 - Test result of outer loops matrix-multiplication on platform #1 ........................ 34 Table 5 - Test result of outer loops matrix-multiplication no structure on platform #1 ........ 36 Table 6 - Test result of matrix-multiplication for resulting matrix on platform #1 .............. 37 Table 7 - Test result of matrix-multiplication for tiling strategy on platform #1 ................ 40 Table 8 - Tiling with 2 and 4 outputs per thread comparison for different platforms .......... 44 Table 9 - Kernel invocation overhead ratio of total running time .................................. 54 Table 13 - Cuda built-in variables ........................................................................ 91 Table 10 - GPU specifications for Nvidia GeForce 9400m, platform #1 ............................ 96 Table 11 - GPU specifications for Nvidia GeForce 8800 GS, platform #2 .......................... 96 Table 12 - GPU specification for Nvidia Tesla C1060, platform #3.................................. 97 Table 14 - Selected profile counter from Compute Visual Profiler User Guide .................. 105

vi

Summary
The purpose of this project was to uncover characteristics, features and limitations of the Cuda architecture. An optimisation strategy was formed, containing methods and techniques that supposedly enabled increased performance. Three frequent used linear algebra algorithms for matrix-multiplication, and LU- and QRdecomposition was chosen. These algorithms were then implemented as a simple version, and performance and correctness test were performed. GPU.NET was used as a frame of reference where applicable. The optimisation strategy was used to improve the performance of the implemented algorithms. It was found that a linear block algorithm could achieve better performance, than a regular algorithm. The main output from this project was a list of recommendations and experiences from the tests performed on the linear algebra algorithms. The findings from this project suggested that tiling was the best strategy, followed by latency hiding and coalescing memory access, when optimisation was the goal. In addition to the points above, this list describes recommendations based on the testing: • • • • • • Avoid using structures as parameters in the kernel definitions, use instead simple types or pointers thereof. Target the highest possible Compute Capability level. Among other things, the precision of instructions are better and the result will be more accurate. Unroll loops, by making the threads fine-grained. Generation and thread scheduling are cheap. Thread block size should be a multiple of the warp size (Currently 32). Be aware of the overhead for invoking a kernel. Note that default instructions deviate from IEEE 754, use specific IEEE 754 functions for increased precision, but at the cost of speed. Besides the list and suggestions above, there were also methods with doubtful results: • • The Volkov suggestion yielded performance gains on some systems, but lower on others. Can be useful for low occupancy kernels, but should be tested and evaluated. Data prefetching can both increase and lower performance.

1

It was pointed out that the underlying hardware and its capabilities played an important role to whether an optimisation technique affected performance. Some methods had positive effect on some GPUs, and a negative on others. Analysing and testing should therefore always be performed. The purpose described in the problem definition was achieved, and the learning goals were reached with satisfaction.

2

1 Introduction
This 30 ECTS thesis project has been produced by Mikkel Bundgaard-Ovesen from the 1st February 2011 to 1st August 2011, on the ITU Copenhagen. The project builds on the results from the report ”Documentation of the GPUs usability in advanced parallel calculations” [1], and has been supervised by Peter Sestoft.

1.1 Motivation
The speed of computers has increased over the years as a result of increased demand for processing power. The CPU has from the beginning, been the preferred architecture for computing. But during the last decade, an additional computing architecture has evolved, namely the graphic processing unit (GPU). A GPU, also called a massively parallel processor, offer tremendous performance in gigaflops, at a relatively low cost. Different parallel computing architectures, such as Nvidia’s Compute Unified Device Architecture (Cuda), Open Computing Language (OpenCL) and Microsoft’s DirectCompute have been developed to serve as a platform for general purpose programming on the GPU (GPGPU), to enable massively parallel programs. Utilising the immense GPU power is not a trivial task. The execution model of the GPU is becoming more and more flexible, but being a SIMD model puts restrictions on utilisation. Data-parallel algorithms that have a simple execution path, and high arithmetic intensity are usually well suited for processing by the GPU architecture. But, there are indications that other algorithms, which do not share these characteristics, in fact can be optimised in a way, such that they are accelerated by the GPU. The huge performance offered at a relatively low cost, makes it interesting to find out how this power can be harnessed. In this project, I will look into how the linear algebra operation matrix-multiplication and the decompositions LU and QR can be implemented and optimised, on the Nvidia Cuda architecture.

1.2 Reading guide
This report is addressed to persons with interest for GPGPU. The report assumes that the reader has knowledge of C and development experienced. Knowledge of linear algebra and the algorithms would be beneficial. References in the report are shown in the text as [number], and the reference can be found in the bibliography and references list.

3

The report is divided into 11 chapters. The first chapter describes the purpose and the goals of the project. The second chapter gives a short introduction to linear algebra and GPU computing history, readers can skip this chapter. The third chapter describes the parallel platforms Cuda and GPU.NET. The paragraphs 3.1.4 and 3.1.5 are the most important. Chapter 4 focuses on the hardware platform and its influence on performance. The different development and test systems are described, and the importance of the chipset’s North Bridge is described. The paragraph 4.2.2 holds a description of the CGMA term, and an analysis is performed. Chapter 5 describes the development environment together with some design decisions. The most important section is 5.3 that holds the optimisation strategy used later. Chapter 6 analyses the matrix-multiplication algorithms, and describes its implementation and optimisations. The results of the improvements are found throughout the chapter, but an evaluation can be found in paragraph 6.4. Chapter 7 deals with LU-decomposition. The chapter describes the algorithm, its implemented and optimisation, along with the test results. An evaluation can be found in paragraph 7.4. Chapter 8 looks into QR-decomposition. The algorithm is described and analyses, after which it is implemented, improved and tested. Results are found throughout the chapter, but an evaluation can be found in paragraph 8.5. Chapter 9 tries to summarise results from all three algorithms, and compare them with the initial optimisation strategy. This chapter is important and serves as an evaluation and conclusion. Chapter 10 looks at the work done so far, and discusses possible extensions to the project. A broader perspective is also discussed, looking into Cuda, hardware and GPGPU in general. Chapter 11 is solely the conclusion to the problem definition, for an evaluation on the projects results, please refer to chapter 9.

4

1.3 Problem definition
The purpose of this project has changed during the project period. Initially the focus was firstly to identify linear algebra algorithms suitable for implementation on the parallel Cuda architecture, secondly in the process of implementing these algorithms, trying to understand the Cuda architecture. In reality, my supervisor Peter Sestoft and I agreed to focus on three frequent used linear algebra algorithms for matrix-multiplication, and LU- and QRdecomposition. By having these algorithms selected in advance, this project can focus on the core objective, to uncover characteristics and features of Cuda. The following statement and the elaborating points reflect this clarification: Firstly, implement linear algebra algorithms for matrix-multiplication, LUdecomposition and QR-decomposition and evaluate their performance on the parallel GPU Cuda architecture. Secondly, to analyse, test and describe different optimisation techniques relevant to the Cuda implementations, and furthermore describe how they may be used in general. • • • Describe the linear algebra algorithms and their characteristics. Describe the Cuda architecture and development platform, uncovering characteristics, features, problems and a future outlook. Analyze and implement the linear algebra algorithms on the Cuda platform. Describe how an implementation can be performed including any benefits as well as limitations. • • Analyze and document optimisation techniques for the algorithm implementations on Cuda. Perform correctness and performance test.

Learning goals • • • • Knowledge of linear algebra and linear algebra algorithms Understanding of the Cuda architecture and platform Obtaining skills in C/C++ and Cuda C Ability to implement linear algebra algorithms using C/C++ and C for Cuda

5

1.4 Method
1. Study literature on linear algebra, C and C++ and Cuda architecture development. 2. Implement basic versions of linear algebra algorithms in C/C++ and C for Cuda using Visual Studio and Nvidia Nsight. Develop tests and benchmarks and compare results with comparable CPU implementations. 3. Implement optimisations for the algebra algorithms and compare results with CPU implementations. As mentioned before, this thesis builds on the experiences and results of the project ”Documentation of the GPUs usability in advanced parallel calculations” [1]. One of the goals of that project was to uncover how the GPU could be utilised from .NET. This is not a specific goal for this thesis; however, I regard it an important perspective. During the thesis research period, I discovered GPU.NET by TidePowerd, a framework and tool whose main feature is to bridge Cuda and .NET. In this project, GPU.NET will be used, where it makes sense, to compare algorithm implementations and their performance with the pure Cuda implementations. It will be interesting to see how GPU.NET performs compared to pure Cuda C/C++, and furthermore, whether GPU.NET is easier to. Testing the correctness of algorithms in both GPU.NET and Cuda will be compared to results computed by the CPU.

1.5 Scope
All areas of this project cannot be analysed and documented, prioritising is important so the parts that are processed is done with adequate depth.

1.5.1 Algorithms
This project is an empirical study that should document implementation, optimisation and performance of existing linear algebra algorithms on the Cuda platform. It is not part of this project to develop new algorithms, but merely to base the testing on existing. The algorithms selected are designed for dense linear algebra, and are well known and well documented.

1.5.2 Numerical stability
A numerical stability analysis of the different algorithms is outside the scope of this project. The algorithms are well known, and are well documented in terms of numerical stability. That said, the applicability of an algorithm implementation obviously depends on it delivering

6

a correct result. All algorithms are implemented for both the GPU and CPU, and tests are performed on both platforms to compare the results. The maximum difference in the result indicates how well and precise the GPU implementation performs compared to the CPU implementation.

1.5.3 IEEE 754 and double-precision
Devises with Cuda, supports double-precision floating point operations from Compute Capability (CC) version 1.3 and higher. Support for double-precision operations in the development and test computers is not a common denominator, furthermore double-precision operations impact performance significantly, why I choose to implement algorithms using single floating point operations. The IEEE 754 standard for floating point arithmetic, is supported and followed by Cuda, but there are documented deviations from the standard [2]. For instance the FMAD (multiply-add) instruction looses precision by combining two operations into one instruction, and there are other deviations. These deviations will influence the correctness test, but because performance is prioritised higher than exact precision, I will not take any specific precautions, as it would impact performance. The maximum differences from the correctness test will give an indication as to how these deviations from the IEEE 754 standard, may affect precision.

1.5.4 BLAS
Basic Linear Algebra Subprograms (BLAS) is an interface for linear algebra operations, and it offers optimised operations for vectors and matrices. Many linear algebra algorithms are designed on the basis of these operations, but the implementations in this project will not use any BLAS API, even though Cublas would be obvious. To really uncover the architectures capabilities it is necessary to experiment with it directly. For that reason I implement all algorithms without the use of such math libraries. This will mean that the full performance potential of the algorithms will not be achieved, but it will give better insight.

7

2 Background
This chapter will work as an introduction to the ideas that will be used throughout the report. Firstly linear algebra will shortly be described, after which the concept of parallel computing in relation to the GPU.

2.1 Linear algebra
Hermann Grassmann is known as the inventor of linear algebra [3]. He did not invent and describe the entire field, but recognized linear algebra as a formal theory. In his two releases of “Ausdehnungslehre”, he describes some important ideas that helped define the basis of linear algebra as it is known today. Linear algebra is a term that covers several different topics and binds them together. Some of these topics are system of linear equations, linear transformation, vectors, matrices, determinants and vector spaces. Frequently used in linear algebra are matrix-multiplication, LU- and QR-decomposition, which each serves a purpose in either solving a system of linear equations, or a linear least squares problem. These problems are then again often encountered in the fields of research, engineering, physics, economics and statistics.

2.2 GPU computing
The performance and capabilities of graphic processors has gone through an incredible development from the beginning, and up till today. From the command line based user interfaces in the 1980s, to the more graphical driven interfaces from the 1990s and all the way till today, graphic processing power has increased and has evolved to support 2D, 3D, OpenGL, DirectX and more [4]. The release and popularity of 3D computer games furthermore accelerated the demand and development of more and more powerful graphic processors. Nvidia released on 31st August 1999 the GeForce 256, the release of the world’s first GPU [5], a GPU that could perform graphical computations directly on the graphics processor. ATI, the main competitor of Nvidia, soon followed, by releasing their Radeon R100 chips with the same capabilities. But it was not until 2001 that the major breakthrough in relation to GPU computing came. Nvidia released the GeForce 3, the first chip to support Microsoft’s DirectX 8 standard, which required the chip to support programmable vertex- and pixel shaders. ATI followed with the release of their Radeon R300 chip in 2002.

8

Programmable vertex- and pixel shaders were intended solely for graphics rendering, however they were actual small programs that performed a programmed computation on some input, and then returned the output. The computational power of the GPU combined with the programmable vertex- and pixel shaders feature made developers look into how the GPU, could solve other problems than just graphics rendering.

9

3 Parallel platforms
The following chapter will take a deeper look at Cuda and GPU.NET; describe usage, features and performance limiting factors. GPU.NET uses Cuda, so most energy will be on describing and analysing Cuda.

3.1 Cuda
Cuda stands for Compute Unified Device Architecture and is a generic term covering the GPU architecture of Nvidia’s graphic cards, development platform and tools. It can be described as a parallel computing architecture and development platform that enables the GPU to solve general purpose computational problems [4].

3.1.1 History
The first GPU was released in 2001, and in the early stages the only way to access the GPU was through a graphics API, such as OpenGL or DirectX. This meant that general use of the GPU was difficult. Nvidia saw the potential of the GPU as another computing platform, and they initiated the development of a completely new architecture. This architecture was to overcome the limitations of earlier GPU’s, by allowing General-Purpose computation on Graphics Processing Units (GPGPU), without the need to use a graphics API. Nvidia released in 2006 GeForce 8800 GTX, the first GPU to support the Cuda architecture. Later, in June 2007 the first version of the Cuda development toolkit was released. Over the years the architecture and toolkit have undergone development and improvements, with the latest toolkit released May 2011.

3.1.2 Version
The latest version of the Cuda when this project initiated, was version 3.2, released November 2010. Many things have happened since, and the current toolkit version, as of May 2011, was version 4.0. Some of the new features include “Share GPUs across multiple threads”, “Use all GPUs in the system concurrently from a single host thread” and “No-copy pinning of system memory, a faster alternative to cudaMallocHost()”. Even though the last feature is interesting, none of these newly added features bring any major benefit to this project, so any upgrade during the project phase was deemed unnecessary. Hence, version 3.2 is used throughout this project and report.

10

3.1.3 Cuda program
A Cuda program is a hybrid between code processed by the CPU and code processed in parallel by the GPU. The CPU is referred to as the host, and the CPU code is called host code. The GPU is referred to as device, and the code is surprisingly called device code. A typical example of a Cuda program is shown in the following. The host code is written in standard C or C++ as shown here:
1. 2. 3. 4. 5. 6. 7. 8. 9. 10. 11. 12. 13. 14. 15. 16. 17. 18. 19. 20. 21. 22. 23. 24. 25. 26. // Declare device pointers int *d_base, *d_n, *d_out; int blocks = (N+THREADS_PER_BLOCK-1)/THREADS_PER_BLOCK; // Allocate cudaMalloc( cudaMalloc( cudaMalloc( memory on device (void**)&dev_base, N * sizeof(int) ); (void**)&dev_n, N * sizeof(int) ); (void**)&dev_out, N * sizeof(int) );

// Copy date from host -> device cudaMemcpy( d_base, base, N * sizeof(int), cudaMemcpyHostToDevice); cudaMemcpy( d_n, n, N * sizeof(int), cudaMemcpyHostToDevice); // Execute kernel power<<<blocks,THREADS_PER_BLOCK>>>(d_base, d_n, d_output, N); // Copy data from device -> host cudaMemcpy( out, d_out, N * sizeof(int), cudaMemcpyDeviceToHost); // Free memory on device cudaFree( d_base ); cudaFree( d_n ); cudaFree( d_output ); // Let the Cuda runtime now that we are finished cudaThreadExit();

11

The device code is structured in a function called a kernel, as shown here:
1. // Device kernel called power 2. __global__ void power( int *base, int *num, int *output, int N ) { 3. 4. // The unique thread id 5. int tid = threadIdx.x + blockIdx.x * blockDim.x; 6. 7. // Guard, only if thread has actual data to process 8. if (tid < N) { 9. 10. // Initialise register with values from input array 11. int p = 1, base = base[tid], num = n[tid]; 12. 13. // Compute p 14. for (int i = 1; i <= num; ++i) 15. p = p * base; 16. 17. // Write result to output array 18. output[tid] = p; 19. } 20. }

The host sends commands and messages to the device by invoking functions. A sequence diagram, based on the program illustrated above, is shown in Figure 1.

Figure 1 - Cuda program sequence diagram

Line 6 to 8 in the host code allocates memory on the device, line 11 and 12 copies data to the device and the kernel is invoked in line 15. When the kernel is done processing, the host

12

copies the data back from the device memory in line 18, where after the memory is release again in line 21-23.

3.1.4 Architecture
Cuda is an architecture consisting of both the physical layout of the GPU and the logical structure of threads in the Cuda runtime. The exact physical layout and specifications differs for different chip versions1, and the capabilities of these chips are defined by the Compute Capability version (CC). The first chips were released with CC 1.0, and the latest version is 2.1. Physical layout The GPU shown in Figure 2 is a simplified G86 GPU with CC version 1.1. It consists of four streaming multiprocessors (SM) and each SM contains 8 streaming processors (SP) and two special function units (SFU). In a SM, the SP processes normal instructions like add and multiply, and in this case 8 SP’s are able to process 8 normal instructions per clock cycle. The SFU processes instructions related to square root, sine and cosine, logarithmic and exponential, so a kernel with heavy usage of these instructions will be limited to only two instructions per clock cycle. A SM has, beside the SP and SFU, also access to different memory types. The register and shared memory are limited in size, but very fast. They are in addition to this local to each SM. The register is the fastest memory type, and local to a thread, and the number of 32-bit registers of a SM with CC 1.1 is 8192, or 8K. The shared memory is a bit slower, but there is more of it and it is shared between the threads in a block. The shared memory size of a SM with CC 1.1 is 16KB. All of SM’s of a GPU has read access to the constant memory, which for all current CC versions is 64KB. Access to the constant memory is cached and generally faster than the global memory, which is the device’s main memory. All SM’s have shared access to it, and the exact size and speed is device dependent.

1

For instance G80 allows 768 threads per multiprocessor, GT200 allows 1536. 13

Figure 2 - Cuda architecture with four multiprocessors

Memory As described above, there are different memory types, but the common denominator is that they are all typically based on dynamic random access memory (DRAM). Accessing a single bit in a DRAM cell is a slow process, and to improve performance DRAM controllers read several consecutive bits in parallel [4]. This means that actual random access to DRAM memory will yield a low performance. So to achieve the highest memory performance possible, the kernel should access consecutive memory locations, as much as possible. This is also called coalesced memory access. Accessing memory in a coalesced manor is important for all memory types. This also holds for shared memory even though it is on-chip and fast, and in addition to this, access to shared memory should also minimise bank conflicts. Shared memory on CC version 1.1 has 16 banks, and the bandwidth of each bank is 32 bits per two clock cycles. If two or more threads access the same bank, the access will be handled sequentially and hence impact performance. From CC version 2.0 and higher, simultaneous access to the same bank has been optimised. Multiple

14

reads from a single bank only results in a single read instruction being performed, after which the values is broadcasted to all threads. Cuda threads Cuda threads are very different from CPU threads; the only similarity is the fact that they process data in parallel. The GPU can be classified as SIMD, which makes its applicability differ from that of a CPU. A task, or a set of instructions, can be performed on different data in parallel, and the SIMD means that two independent tasks cannot be performed in parallel by the GPU. Furthermore, threads in Cuda are very lightweight compared to threads on the CPU. A typical Cuda program uses several thousand Cuda threads, and thread generation and scheduling should therefore not be considered a limitation. Cuda threads are organised into a block, and the threads in a block can share memory and be synchronised. The blocks are then again organised into a grid. This logical thread structure allows threads to be organised in several dimensions, which makes structuring of threads directly correlate to specific data structures, a matrix for instance is defined in two dimensions. More details about how threads are organised and how this affects usage in a kernel please refer to appendix B. Thread scheduling The presumption is that the threads of a block are grouped, and processed in parallel. This is conceptually true, but in reality is not actually happening. The current implementation of thread scheduling in e.g. G80 and GT200 chips schedules threads using a term called a warp. A warp is a bundle of 32 threads being executed in parallel, and a block with for instance 128 threads, are partitioned into 4 warps. These threads share a single instruction set, hence Cuda is SIMD architecture. This is a design decision to reduce hardware cost and to enable optimisations techniques, and it is not without relevance to the developer. The size of a warp has direct impact on the recommended size of blocks. Consider the example where a problem is organised into 20 blocks each with 10 threads, giving a total of 20 x 10 = 200 threads. Cuda executes 32 threads in a warp in parallel. In the example above, only 10 threads are available per block. Cuda will in this case fill up the warp with 22 empty threads, resulting in 20 x 22 = 440 empty threads being created. It is advisable to set the block size to a multiple of the warp size, currently 32 [4].

15

Occupancy and latency A SM with a CC of version 1.1 is able to handle 768 residing threads, and as the current warp size is 32, the maximum number of residing warps per SM is 24. The actual number is dependent on the kernels consumption of registers and shared memory. The Cuda occupancy is, for a given kernel, the ratio of active warps to the maximum number of warps supported by the SM. In other words, the occupancy indicates how many active warps and threads a SM can hold. The number of clock cycles it takes for a warp to be ready to execute its next instruction is called latency. There are instructions that incur latency, for instance global memory access, which incurs high latency before the data is supplied. The execution of a warp does not halt due to a memory access; execution is continued until the data is actually needed, if the data is still unavailable the scheduler switches to another warp. Whenever a warp incurs latency, the SM should switch to another warp and start processing to achieve full utilisation of the SM. So the Cuda occupancy ratio can indicate if the performance of a kernel suffers from high latency. Vasily Volkov [6][7] have however shown that high performance is not necessarily equal to a high occupancy, so improvements based solely on the occupancy ratio should be carefully evaluated. Optimisation that aims at full utilisation of the SM is called latency hiding.

3.1.5 Limitations
Using Cuda can be advantages, but there also limitations that are influence the implementation and optimisation of algorithms. In the following I describe a couple that are relevant to this project. One should be aware that the Cuda architecture was developed for speed at the expense of precision. There can, for that reason, be a higher numerical instability of an algorithm implemented on the GPU, when compared to the same algorithm implemented on the CPU. For example, the operations multiply and add, can be contracted to a single FMAD (multiplyadd) instruction, which specifically deviates from the IEEE 754 standard. FMAD instructions are for instance often used in linear algebra algorithms to calculate dot-products, vector norms and more. Nvidia has been focusing on this, and latter CC versions should comply better with the IEEE 754 standard. Latency, warps and memory access described in the architecture chapter 20, are all factors influencing the computational performance. One should therefore not expect to reach close to the theoretical performance of a device, as these factors will limit the performance of an algorithm. The theoretical properties can, on the other hand, be of assistance in the analysis and optimisation of a kernel.

16

Cuda C is an extended version of ANSI C, and is the language in which the device code or kernels are written. A kernel function is able to call other device functions, but recursion is currently not supported. Cuda is developed by Nvidia, and can only be used in Nvidia GPUs.

3.2 GPU.NET
The framework GPU.NET consists of a runtime and a compiler, which are integrated with Visual Studio. This framework makes it possible to develop host and kernel code directly in .NET with all the benefits that .NET and Visual Studio provides. The version being used for this project is GPU.NET v1.0.3.5.

3.2.1 Overview
GPU.NET currently only supports Cuda, but expect to support other parallel architectures in the future. GPU.NET allows a developer to write host and device code directly in .NET using the API from the provided assembly, and thereby making computations on hardware accelerated architectures. Accelerating .NET code is achieved in two steps; first the .NET code is written, decorated and compiled, then the GPU.NET runtime accelerates the program during execution, as shown in Figure 3.

Figure 3 - How GPU.NET works as describe on TidePowerd.com

17

3.2.2 Development
Visual Studio 2010 is supported for development as well as .NET 4. The kernel is annotated with a KernelAttribute that also holds the name of the CPU method to be used if no acceleration hardware is present. ThreadIndex and BlockIndex hold the same values as when used in Cuda directly.
1. [Kernel(CustomFallbackMethod = "MatrixMultiplication_CPU")] 2. private static void MatrixMultiplicationSimpleNS_GPU(float[] a, float[] b, float[] c, int aheight, int awidth, int bwidth) 3. { 4. // Thread ID 5. int tid = ThreadIndex.X + BlockIndex.X * BlockDimension.X; 6. 7. if (tid < aheight) 8. { 9. for (int j = 0; j < bwidth; ++j) 10. { 11. float sum = 0; 12. 13. for (int k = 0; k < awidth; ++k) 14. { 15. float av = a[tid * awidth + k]; 16. float bv = b[k * bwidth + j]; 17. sum += av * bv; 18. } 19. c[tid * bwidth + j] = sum; 20. } 21. } 22. }

The kernel is returns void, and is private, and can therefore not be called directly. Another public and static method is created, which calls the kernel shown in line 5.
1. public static float[] MatrixMultiplicationSimpleNS(float[] a, float[] b, int aheight, int awidth, int bwidth) 2. { 3. var c = new float[aheight * bwidth]; 4. 5. MatrixMultiplicationSimpleNS_GPU(a, b, c, aheight, awidth, bwidth); 6. 7. return c; 8. }

The .NET code is compiled to a normal assembly, in which the GPU.NET compiler then injects calls to the GPU.NET runtime. The result is a modified .NET assembly where calls to any kernel method are being redirected to the GPU.NET runtime, and hence the GPU.

18

3.2.3 Execution
When the program is being executed, the GPU.NET runtime detects the availability of any supported hardware. When a call to a kernel is detected, the kernel code is then passed to the correct vendor plug-in, which in turn JIT compiles the code to the hardware vendor’s instruction set architecture. Lastly the runtime executes the compiled device code and transfers any data back to the .NET runtime. If no hardware acceleration is present, then the CPU version of the kernel is called.

3.2.4 Limitations and bugs
GPU.NET is a relatively young framework that contains obvious bugs and limitations. The v1.0.3.5 contains the following bugs and limitations: • • • • There is currently only support for Nvidia Cuda v3.0 and newer, but support for AMD devices are under development. Local variables and parameters can only be of primitive types. Parameters can in addition to that, also be an array of a primitive type. Shared memory can only hold fields which are primitive types or a single dimensional array of primitive types. Kernels must be static and return void and cannot be recursive or call any other methods. I have experienced problems with casting variables in the kernel. Casting variables from double to float, or even float to float resulted in compile errors. My conclusion is that casting is not supported and should be avoided. So it is necessary to design an algorithm exclusively for either single- or double precision. The shared memory size of a kernel must be specified on compile-time; this can in Cuda C be specified dynamically on runtime. Implementing kernels optimised for different data sizes is therefore difficult, which has led me to set the size of the allocated shared memory high. This makes sure that kernels will run with different data sizes, but this is not optimal. Shared memory is a scarce resource, and will lower occupancy. Occasionally when a CPU.NET application was executed for the first time, an exception was thrown. Subsequent executions were processed with no problems. Furthermore, due to a thread bug a GPU.NET program does not exit by itself. I have solved this by terminating the process by calling:
1. System.Environment.Exit(0)

19

The last two bugs are expected to be fixed in newer releases.

3.2.5 Evaluation
GPU.NET can definitely be used for testing and playing with GPU acceleration of programs. But one should, with the current version, expect bugs and minor problems; the framework is far from mature at this point. Furthermore, by using JIT compilation the GPU.NET will incur a performance hit compared to Cuda, as Cuda kernels are already compiled at runtime. This is a design decision made by TidePowerd, a decision that makes the framework flexible at the expense of performance. GPU.NET does however cache the JIT compiled kernel in-memory, so subsequent calls to the same kernel will not incur the same performance hit.

20

4 Hardware platform
The parallel architecture software has been described above, but there are also hardware specifications that are important to the performance of Cuda. The following chapter will analyse hardware specifications and perform two simple tests. Cuda requires Nvidia GPUs, so all development and test computers must be equipped with an Nvida GPU. I used two development and two test computers for this project. All machines are running Windows 7, and have Cuda v3.2 installed, including the matching drivers. The following table shows selected hardware specifications for the platforms.

System Type Graphics bus

#1
Development 8.00 GB/s (PCI-E v2.0) 16.60 GB/s (DDR3) GeForce 9400m (G86) 16 1100 MHz 8.00 GB/s2 (DDR3)

#2
Development 4.00 GB/s (PCI-E v1.1) 6.23 GB/s (DDR2) GeForce 8800 GS (G92) 96 1250 MHz 37.45 GB/s (GDDR3)

#3
Testing 8.00 GB/s (PCI-E v2.0) 16.60 GB/s (DDR3) Tesla C1060 (GT200) 240 1300 MHz 102.40 GB/s (GDDR3)

#4
Testing 4.00 GB/s (PCI-E v1.1) 6.25 GB/s (DDR2) GeForce GT440 (GF108) 96 1645 MHz 51.20 GB/s (GDDR5)

Host memory

GPU Cores Shader clock Device memory Processing power in gigaflops (MUL+ADD+SF) Processing power in gigaflops (MUL+ADD) Compute Capability

34.38

351.56

914.06

462.66

51.56

234.38

609.38

308.44

1.1

1.1

1.3

2.1

Table 1 - Hardware specifications for the four platforms

2

The actual memory speed is 16.60 GB/s, but system #1 has no dedicated device memory and uses host memory. So the device is limited by the speed of the graphics bus. 21

The reason why the GPU of each system has two processing powers stated, is based on the theoretical peak performance in gigaflops Cuda architecture design. The first is based on the GPU architecture design, which says that a GPU is capable of performing a Multiply-Add instruction dual-issued with a special function instruction per operation cycle. The second is based on a more realistic estimation, in which a operation cycle can perform a Multiply-Add instruction dual-issued. For a detailed description of the different platforms, please refer to appendix C.

4.1 Analysis
This paragraph will dig a little deeper into the important hardware specifications, and describe the theoretical performance limits. When dealing with GPUs the important factors are memory transfer rates and GPU processing power. Figure 4 shows a simplified chipset diagram, which highlight the important elements, namely the processor, DDR ram, GPU and the chipset’s north bridge.

Figure 4 – Simplified diagram of a chipset

22

In Table 1, the graphics bus indicates the maximum transfer rate between the north bridge and the GPU device. Host memory is the peak bandwidth between the DDR ram and the north bridge. A chain is not stronger than its weakest link, and the same holds for data transfer between the host and device, and vice versa. Consider platform #1, the bandwidth of the host memory is 16.60 GB/s, but the graphics bus is limited to 8.00 GB/s, which then is the theoretical peak transfer rate between the GPU device and the host system.

4.2 Benchmarking
The specifications are theoretical, and to give a more realistic performance target I have tested actual data transfer and processing power.

4.2.1 Memory performance
All systems have been tested with three types of memory transfers; from host to device, device to host and device to device. To improve performance of memory transfer, host memory can be defined as page-locked (pinned) and write-combined. Pinned and writecombined memory is a scarce resource of the operating system. A Cuda program can therefore not without caution, consume this operating system memory resource, as it could impact overall system performance. Bandwidth measurements, shown in the following table, are performed with both regularly paged memory, and with pinned and write-combined memory (P+WC). For more details please refer to appendix B.

System Host -> Device Host -> Device (P+WC) Device -> Host Device -> Host (P+WC) Device -> Device Device -> Device (P+WC)

#1
1,584.5 MB/s 5,224.9 MB/s 1,365.9 MB/s 5,096.2 MB/s 6,935.4 MB/s3 6,951.3 MB/s3

#2
1,434.7 MB/s 2,513.1 MB/s 1,178.0 MB/s 1,687.9 MB/s 28,525.7 MB/s 28,529.0 MB/s

#3
4,233.7 MB/s 5,761.9 MB/s 3,864.3 MB/s 5,297.6 MB/s 73,463.8 MB/s 73,527.3 MB/s

#4
1,578.6 MB/s 2,509.8 MB/s 1,235.1 MB/s 1,857.9 MB/s 21,338.1 MB/s 21,339.2 MB/s

Table 2 - Measured bandwidth of Cuda memory transfer operations

3

System #1 does not have any dedicated device memory, so actually the rates from host to device are relevant when a kernel needs to access ”device” memory. 23

The measured transfer rates between host and device are, as expected, faster when using pinned and write-combined memory. The ratio between measured bandwidth for paged memory transfers are between 17% and 40% of the theoretical bandwidth, the ratio span increases to between 47% and 72% when pinned and write-combined memory is used. The result also shows that in general, device to host transfers are slower than its counterpart host to device. The memory speed for copying data from host to device and vice versa is mostly important for hybrid algorithms, meaning an algorithm that solves a problem by using both the CPU and GPU. An implementation that requires transfer between the host and the device should be designed with caution, as this would put a restriction on performance. Based on this, the strategy for the algorithm implementations of this project is to keep the data processing solely on the GPU, and limit the number of data transfers between host and device. Consider the worst case memory transfer scenario. System #3 has 4GB of device memory, and if this GPU was installed in the slowest system, it would be possible to copy all 4GB in about 3.57 seconds. Global memory access is limited by device memory bandwidth, so the device to device memory transfer rate is interesting and relevant to the performance of a kernel. The results are between 41% and 75% of the theoretical limit. Paged or pinned/write-combined memory does not have any impact as this is an operating system resource, and hence only relevant for the host memory. The result of the device to device memory transfer for system #1 is a bit misleading. System #1, does not have any dedicated device memory, and a device to device transfer rate then only indicates the peak performance of the DDR3 ram on the host system. Before the data could actually be processed by the GPU, it would have to pass the north bridge and graphics bus, which is the same as the host to device memory transfers.

4.2.2 Arithmetic performance
To get a realistic performance target in gigaflops for the GPU, I have created two kernels each consisting of three operations MUL+ADD+SF. The first kernel is normal in the sense it reads data to process from global memory, by doing so the kernel is limited by global memory performance. But as a normal kernel would make computations based on data from global memory, the peak performance of this kernel can be considered a normal case scenario and help define performance expectations. The second kernel does not access global memory, but consist of just the three MUL+ADD+SF operations. The peak performance in gigaflops of this kernel should be closer to the theoretical maximum arithmetic performance.

24

System MUL+ADD+SF + global read (gigaflops) MUL+ADD+SF (gigaflops)

#1
2.19 15.41

#2
9.70 60.09

#3
27.87 62.29

#4
8.18 29.08

Table 3 - Measured gigaflops performance of GPU

In reality, these kernels do not say anything about the maximum expected performance. An implementation of an algorithm can be optimised in several ways, and so could these kernels. However, they do suggest that global memory access is indeed a limiting factor. This factor is referred to as Compute to Global Memory Access (CGMA). The first kernel has a memory load from input and a write to output. The number of operations are three (multiply, add and power). Based on these numbers, the calculated CGMA is 1.5. Consider system #3, the device memory peak performance is 73,463.8 MB/s. The kernel uses single floating values that are 4 bytes. So the system is able to transfer about 18,365.95 mega single float values. The CGMA is 1.5; hence the peak performance of this kernel is about 27 gigaflops, which the result from Table 3 also shows. The memory transfer rates, together with an estimated CGMA, can be important tool when analysing a kernel for optimisation. With these results in mind, how are they compared to the performance of a CPU? Consider that a Pentium 4 3.06 GHz CPU computes a single-precision float values dot-product with between 1.8 gigaflops (single thread) and 3.08 gigaflops (multiple threads) [8]. In that light, even though the results from Table 3 are far lower than the theoretical processing power, it is evident that even un-optimised kernels could have a similar peak performance or even higher, when compared to the CPU.

25

5 Implementation
In the following chapter I describe the development environment and some design decisions, but more importantly, I form an optimisation strategy used for the algorithms.

5.1 Development environment
The computers used in this project are based on Windows 7, and the Cuda toolkit version 3.2 was the latest release when this project was initiated. Cuda v3.2 natively supports Visual Studio 2008 (VS2008). It is possible to enable development in Visual Studio 2010, but has proven difficult to setup. Making VS2008 ready for Cuda development is not a trivial task. The compilation process includes two compilers, the Nvidia Cuda Compiler (NVCC) and Microsoft’s Visual C++ compiler (VCC). To configure VS2008 properly and making the NVCC, VCC and linker play together remained a challenge. Please refer to appendix D for a detailed description of the problems involved, and for a development model solution.

5.2 Design decisions
The general rule is, the parts of the algorithm that exhibit little or no data parallelism should be processed by the host, the parts that exhibit rich amount of data parallelism should be processed by the device. Sometimes it is beneficial to process code on the device that cannot exploit the parallelised architecture. The decisive factors in these situations are the size of the data, and the time needed to transfer data between the host and the device. The strategy I will follow is to limit transfer of data between host and device to a minimum [9], by reducing these transfers to an initial and a final one, like this: 1. Copy data to device 2. Process data on device 3. Copy data to host This means that these data transfers are not part of the actual algorithm, and when measuring peak processing power in gigaflops, I only measure the core algorithm execution time. Meaning, the initial configuration and data transfer, combined with the releasing of resources and retrieval of the output, is not being measured. By exclusively measuring the core algorithm, it is possible to directly compare the peak processing power of the GPU with that of the CPU.

26

As mentioned earlier, support for double-precision operations is not a common denominator for the development and test machines. Algorithms are therefore implemented using singlefloating point precision, which is supported by all GPU devices, the CPU and GPU.NET.

5.3 Optimisation
The aim is to implement and optimise three linear algebra algorithms for the Cuda architecture. The method for doing so is composed of the following steps: 1. Use an existing, well-known and well-documented algorithm, for implementation in C/C++ for CPU processing. 2. Analyse and update CPU implementation to Cuda C, while making simple improvements that exploits the parallelised architecture. 3. Test the implementations. 4. Optimise based on the test results, and test again.

5.3.1 Strategy
The Nvidia paper on “Analysis Driven Optimization” [10] identifies four categories of what can limit a kernels performance; memory throughput, instruction throughput, latency or a combination of the above. There are some methods that can be helpful in finding the limiting factors of a Cuda program. To determine if memory throughput is a limiting factor, the CGMA of a kernel can help determine the theoretical maximum performance of a kernel. When it comes to instructions, the Nvidia profiler can give valuable information about undesirable code. Please refer to appendix E for at description of the Cuda profiler and CGMA. There exist different optimisation techniques and methods, and some have already been described in the chapters 3.1.4 and 3.1.5. In the following I will describe methods that form the optimisation strategy. Algorithms that process data rely on memory to perform well. Coalescing memory access is important for all memory types, and in addition to this, shared memory should avoid bank conflicts as much as possible. A loop structure in a kernel adds extra control flow instructions, which will consume arithmetic resources. The organisation of threads in several dimensions can enable the unrolling of a loop by increasing thread granularity. The compiler already unrolls small loop structures, but doing it manually can help making a kernel run faster.

27

Whether the block size should be high or low depends on the kernel, but it should where possible by a multiplier of the warp size (currently 32), to avoid empty threads. Hiding latency of slow instructions can be achieved by reorganising the kernel, exploiting data prefetching or making sure that the kernel Cuda occupancy is high. Notice that a high occupancy is not equal to high performance [6][7]. Vasily Volkov has shown that a kernel can increase performance, by instead of outputting a single result, then outputting several results per kernel. It has also been proved that using this method, high performance can be achieved with a low occupancy. Another technique for increasing performance focuses on using fast memory, such as the register or shared memory. Updating an implementation so that it divides data into smaller pieces, called tiles, that fits into caches or shared memory can be very effective. The cost of copying data is amortised, and the kernel will process the cached data. Some algorithms are not designed for parallel processing, and the performance they can deliver on a parallel architecture, is not very high. Instead, for some algorithms, a block version has been designed. A block algorithm usually has three advantages over normal algorithms. Firstly, they are able to solve much larger problem, by dividing the problem into smaller pieces and solving them independently. Secondly, dividing a problem into smaller sizes is the core of the tiling implementation strategy, so using a block algorithm can automatically enables the tiling. Lastly, the block algorithms sometimes rely on other linear algebra operations that are highly parallel, for instance matrix-multiplication. So to clarify, tiling refers to a specific implementation that exploits a faster memory type, whereas block refers to the algorithm. For some algorithms block and tiling is almost the same (e.g. matrix-multiplication), for others they are not.

28

6 Matrix-multiplication
A matrix is essentially a rectangle array of numbers, and is often denoted with a capital letter. Here the matrix A with two rows and three columns is shown.

=
The numbers or values in a matrix are called elements, and are by convention denoted where r is the row index and c is the column index. The row index indicates in which row the element lies, where the column index indicates the column in which the element lies. Matrix-multiplication, also called matrix product, is a linear algebra matrix operation consisting of the operations multiplication and addition. Elements in the respective matrices are aligned, multiplied, added, and then the grand sum is placed into the resulting matrix. The process of performing matrix-multiplication on two matrices is only possible, if their dimensions conform for multiplications, meaning the number of columns of the first matrix should be equal to the number of rows in the second. The resulting matrix will be a matrix, where second. is the number of rows of first matrix and is the number of columns of the ×

Figure 5 - Matrix-multiplication process depicted

29

It should be further noted that this operation is not commutative, hence

Except for special cases, where matrix-multiplication actually is commutative. These cases are however not described in any further details, as they are outside the scope of this report. matrices is 2 ∗ The naive process of matrix-multiplication is rather simple. The data size of two square and the running time is O( ) where n is the width and height of the matrices. This shows that the running increases more than the data size. The simple or naive implementation will be discussed and shown later, but there exist other algorithms which are more efficient, for instance the Strassen's or Coppersmith–Winograd algorithms [11]. However these algorithms add complexity to the implementation, and require extra attention to handling numerical stability issues. This project’s matrix-multiplication focus should be on optimisations for the GPU platform, and not the algorithm itself. Then, for that reason, will only the simple implementation serve as a base for analysis, implementation and testing, and not the other algorithms mentioned. Optimisations applied to the simple algorithm will focus on capabilities and properties of the GPU platform, and the essence of the original matrix-multiplication algorithm, will be kept. Futhermore, Cuda is the focus of this project, meaning the implementation and optimisations will be focussed on Cuda and the C/C++ implementations. GPU.NET will be used in the result section as a perspective and for comparison. In the following the matrix named A will always reference the first matrix of the matrixmultiplication process. The second matrix will be named B and the resulting matrix C, like so: ∗ =

×

× .

6.1 Analysis
Parallel processing on a GPU platform is stream based, and supports the parallelisation of data very well. The simple nature of the matrix-multiplication algorithm makes the implementation, for processing on a GPU platform, straightforward. The fact that the algorithm has running time of O( ), makes optimisations and performance gains easier to test and time on different platforms, and with different data sizes.

6.1.1 The sequential algorithm
As mentioned earlier, the focus will be on the simple matrix-multiplication algorithm. The sequential algorithm consists of the following steps:

30

1. for (int i = 0; i < A.rowCount; ++i) 2. for (int j = 0; j < B.columnCount; ++j) { 3. double sum = 0; 4. for (int k = 0; k < A.columnCount; ++k) { 5. double a = A[i][k]; 6. double b = B[k][j]; 7. sum += a * b; 8. } 9. C[i][j] = (float)sum; 10. } 11. }

operation. This also shows that the running time is O( ∗ For square matrices where = =

This algorithm consists of three loops and the inner loop has an addition and a multiplication in A, m the number columns of A and rows in B, and lastly p is the number of columns in B. the running time is O( ).

∗ ) where n is the number of rows

The inner loop computes the dot-product of the vectors of A and B. The two outer loops are responsible for iterating through the rows of A and columns of B, and their mutual order does not influence the running time.

6.1.2 Parallelism
The simple matrix-multiplication algorithm consists of three loops. One adjustment to induce concurrency is to perform the outer loop in parallel; another to calculate each value of the resulting matrix in parallel. In any case, there is not just one single solution to making matrix-multiplication work in parallel, but multiple. These different approaches will be discussed in the following. The outer loop A simple approach to make the outer loop work in parallel is to make threads handle each row in A. This is possible as there are no synchronization issues to handle, but it means that the total number of required threads will be equal to the number of rows in A. This adjustment is doable; however any performance gains or losses are dependent on the data size. Consider the case where A is a column matrix and B a row matrix, this would mean, many threads that do little work. Resulting matrix values Calculating the different values of the resulting matrix is another way of making the algorithm work in parallel. By using this method the required number of threads will be equal to the size of the resulting matrix C, which is:

31

=

This shows that this approach also is prone to the problems where A is a column matrix and B a row matrix. Again, many threads do little work. For simplicity, I will not handle this case specifically, but will later present an algorithm that performs better with different data sizes.

6.2 Simple algorithm
Initially a simple and straightforward matrix-multiplication algorithm is implemented and tested. Based on the results and the properties and capabilities of the Cuda architecture, different optimisation techniques are implemented, tested and then evaluated.

6.2.1 The algorithm
The sequential algorithm described in the analysis is used to calculate the resulting matrix on the CPU. This reference matrix will be used as a comparison to the GPU calculated result matrix, and as such function as the correctness test. As described in the analysis, the algorithm can be made parallel by making the outer loop or the calculation of the resulting matrix values processed in parallel. These methods are straightforward but have some drawbacks, meaning the performance is dependent on the data size. The solution is to find the best balance between threads and their workload, and the means is segmentation of data in blocks. The kernel to the first solution looks like this:
1. __global__ void matrixMultiplicationSimple(matrix *a, matrix *b, matrix *c) { 2. 3. // Thread ID 4. int tid = threadIdx.x + blockIdx.x * blockDim.x; 5. double sum, av, bv; 6. 7. if (tid < a->height) { 8. 9. for (unsigned int j = 0; j < b->width; ++j) { 10. 11. sum = 0; 12. 13. for (unsigned int k = 0; k < a->width; ++k) { 14. av = a->n[tid * a->width + k]; 15. bv = b->n[k * b->width + j]; 16. sum += av * bv; 17. } 18. c->n[tid * b->width + j] = (float)sum;

32

19. 20. }

} }

single thread O( ∗

What is important to note is that the kernel has a loop in a loop making the running time of a ) where n is columns in B and m the columns in A.

The kernel that calculates the values of the resulting matrix does not have this double loop.
1. __global__ void matrixMultiplicationRM(matrix *a, matrix *b, matrix *c) { 2. 3. // Matrix C coordinates 4. int c_column = blockIdx.x * blockDim.x + threadIdx.x; 5. int c_row = blockIdx.y * blockDim.y + threadIdx.y; 6. double sum, av, bv; 7. 8. // Make sure not to exceed C boundaries 9. if (c_row < c->height && c_column < c->width) { 10. 11. sum = 0; 12. 13. for(int i=0; i < a->width; i++) { 14. 15. av = a->n[c_row * a->width + i]; 16. bv = b->n[i * b->width + c_column]; 17. sum += av * bv; 18. } 19. 20. c->n[c_row * b->width + c_column] = (float)sum; 21. } }

The two kernels are different in the sense that the first does more work than the last. In the last kernel, a loop structure was unrolled, the firstly makes the threads more fine-grained, which have a higher parallel potential. Secondly, the control flow instructions from the loop are not performed, releasing more resources for the kernel.

6.2.2 Test and results
Testing is performed on different platforms, and to dedicate most performance possible of the GPU to the algorithm, rather than rendering of the results, the tests are executed using a console program.

33

console testing program Figure 6 – The output of the c Matrix-multiplication is tested where matrix A is 2 matrix C will be 2 Outer loop The Cuda occupancy calculator showed that the simple outer loop implementation would showed have a multiprocessor occupancy of 83%. The outer loop implantation was tested with the use of the matrix structure as a parameter for the kernel. The kernel running time is for Cuda, the direct GPU calculation time; hence it is exclusive is, ation the time to perform data transfer. The Cuda kernel running time does not say anything about, whether it is feasible to perform matrix multiplication on the GPU compared to the CPU, but matrix-multiplication only whether the GPU calculates the resulting matrix faster than that of the CPU. By just resulting measuring the calculation time, it is easy to directly compare the computation time on the GPU with that of the CPU. × . × ×

and B is

. The resulting

Platform

Kernel running time

Operations/ms

Gigaflops/sec /sec

CPU running time

Cuda

349.15 ms

366.609

0.37 0

250.90 ms

Table 4 - Test result of outer loops matrix multiplication on platform #1 matrix-multiplication

34

Some interesting results have emerged from this initial test. First of all, there is a difference in the result calculated on the GPU from that calculated on the CPU. The maximum difference in the resulting values is 0.010742 performed on platform #1. The GPU architecture was initially designed for increased speed, on the cost of precision which partly explains the difference in the resulting values. Newer architectures implement an instruction set with increased precision, this kernel have been tested on an architecture with compute capability v2.0, where the difference between GPU and CPU results was 0.0. Another surprise to see from the test, the GPU calculation actually has a peak performance of 0.3666 gigaflops, and takes longer than on the CPU. What causes such a bad performance might one ask? Outer loop without structure Global reads are expensive and coalesced memory reads should be achieved to optimise performance. Structures can, if not aligned, produce non coalesced memory access. Whether using structures as parameters for the kernel had any impact on performance, would be interesting to test. So, minor adjustment to the code where made to eliminate structures as parameters, and the updated kernel function definition now looked like this:
1. __global__ void 2. matrixMultiplicationSimpleNS(float *a, float *b, float *c, int aheight, int awidth, int bwidth)

The adjustment meant that the occupancy of the multiprocessor rose from 83% to 100%, an increase, which suggested that better performance could be expected. But the kernel calculation running time was tested to 357.88 ms. A running time that is approximately the same as using structure parameters. Even though a higher occupancy suggested increased performance, no performance gain was achieved. This result is confirmed by Vasily Volkov test on “Better performance at lower Occupancy” [6]. So the first optimisation will look into reorganising the threads, to try if Cuda performs better when more threads perform less, than when few threads perform more. But before doing so, testing and comparing performance with GPU.NET would indicate, whether the GPU.NET API performs on the same level as using Cuda directly. GPU.NET The GPU.NET platform has different limitations; one is that kernel methods only support primitive types as parameters. Of that reason, testing matrix-multiplication with the Matrix

35

structure is not possible. So the outer loop matrix-multiplication method was implemented without structures, and can therefore be directly compared to the similar Cuda implementation. It is furthermore not possible, on the GPU.NET platform, to measure solely the direct calculation time, the kernel running time is inclusive data transfer and JIT compilation.

Platform

Kernel running time

Operations/ms

Gigaflops/sec

CPU running time

Cuda GPU.NET

357.88 ms 409.00 ms*

357,661 312.958

0,36 0.31

381.23 ms 387.00 ms

Table 5 - Test result of outer loops matrix-multiplication no structure on platform #1

* Inclusive data transfer and JIT compilation Taking data transfer and JIT compilation into account, the performance of GPU.NET and Cuda are almost identical. It will be interesting to see whether this is also the case, when different optimisation techniques and features are exploited.

6.3 Optimisation
The Cuda architecture has different characteristics and capabilities, and to optimise performance different features and techniques can be utilised. First the unrolling of a loop will be tried, after which the tiling and other methods from the strategy will be applied.

6.3.1 Unroll loop with threads
The simple implementation was designed so fewer threads performed more, the first optimisation will try and uncover if more threads performing less, by unrolling a loop, actually is better. This can be achieved by modifying the algorithm, so each thread calculates a value in the resulting matrix.

36

The modified kernel is shown in the following:
1. __global__ 2. void matrixMultiplicationRM(matrix *a, matrix *b, matrix *c) { 3. 4. // Matrix C coordinates 5. int c_column = blockIdx.x * blockDim.x + threadIdx.x; 6. int c_row = blockIdx.y * blockDim.y + threadIdx.y; 7. double sum, av, bv; 8. 9. // Make sure not to exceed C boundaries 10. if (c_row < c->height && c_column < c->width) { 11. 12. sum = 0; 13. 14. for(int i=0; i < a->width; i++) { 15. 16. av = a->n[c_row * a->width + i]; 17. bv = b->n[i * b->width + c_column]; 18. sum += av * bv; 19. } 20. 21. c->n[c_row * b->width + c_column] = (float)sum; 22. } 23. }

Test and result A similar modification was made to the kernel in GPU.NET and the running times are shown in the following table.

Platform

Kernel running time

Operations/ms

Gigaflops/sec

CPU running time

Cuda (outer loop) Cuda GPU.NET

349.15 ms

366,609

0.37

250.90 ms

53.04 ms 143.00 ms*

2,413,297 895,104

2.41 0.90

253.14 ms 372.00 ms

Table 6 - Test result of matrix-multiplication for resulting matrix on platform #1

* Inclusive data transfer and JIT compilation The running time of both the GPU.NET and Cuda implementation has decreased. GPU.NET performs about 2.86 times better than the GPU.NET outer loop implementation, however the

37

performance increased is disappointing when comparing the performance gain when purely using Cuda. When looking solely at the Cuda implementation the performance increase is almost 6.6 times better than the outer loop approach. So even though the GPU.NET performance, compared to Cuda, is disappointed, the performance gains are significant, and indicate indeed that more threads doing less by unrolling a loop is a reasonable approach. When programming for parallel execution on the CPU platform, it is important to use the correct amount of threads to solve the problem optimally, and as spawning a thread is expensive not to many threads should be used. The results from these tests show that this rule of thumb does not apply to the GPU platform. The overhead for creating a thread in Cuda, is far less than that of creating threads on the CPU. Another factor is the amount of global memory reads. Reading from global memory is expensive and very slow [5]. When looking at the code of the kernel, it shows that the inner loop makes two global memory reads and one multiplication and addition operation. This equals a CGMA ratio of approximately 1.0. On platform #1 the global memory has a peak performance of 16.6 GB/sec bandwidth. With 4 bytes in each single-precision floating-point value, the expected giga single-precision data per second is 4.15 (16.6/4). With a CGMA ratio of 1.0, this kernel will not execute at no more than 4.15 gigaflops [4]. So in short, this kernel is memory-bound and to optimise a memory bound kernel, the focus should be on global memory access. One method for doing this is to

6.3.2 Tiling v1
One of the fastest memory types on a Cuda device, is the shared memory. The shared memory is on-chip and very fast, but also limited. Shared memory is accessible and shared by all threads in a block, so it is obvious to use it as a block cache. One strategy for reducing global memory traffic is to partition data into tiles that will fit into the shared memory. Then load a tile of data from device memory into shared memory, process the data and lastly write the results back to device memory [2]. One important criterion is that the computation on these tiled data must be able to, be processed individually. This requires the threads in a block to be synchronised, as shown in the following kernel code:

38

1. __global__ 2. void matrixMultiplicationTILINGns(float* a, float* b, float* c, int aWidth, int bWidth) { 3. 4. // blockDim.x = TILING_DIM (last is defined and hence faster) 5. // blockDim.y = TILING_DIM (last is defined and hence faster) 6. 7. int bx = blockIdx.x; 8. int by = blockIdx.y; 9. int tx = threadIdx.x; 10. int ty = threadIdx.y; 11. 12. // Matrix C coordinates 13. int c_column = bx * TILING_DIM + tx; 14. int c_row = by * TILING_DIM + ty; 15. 16. // Calculate the first index in of row in a, and the last for the 17. // current thread 18. int aIdxBegin = c_row * aWidth + tx; 19. int aIdxEnd = aIdxBegin + aWidth - 1; 20. int bIdxBegin = c_column + bWidth * ty; 21. 22. float sum = 0.0; 23. 24. for (int aIdx = aIdxBegin, 25. bIdx = bIdxBegin; aIdx <= aIdxEnd;) { 26. 27. __shared__ float ac[TILING_DIM][TILING_DIM]; // A cache 28. __shared__ float bc[TILING_DIM][TILING_DIM]; // B cache 29. 30. // Load values to cache 31. ac[tx][ty] = a[aIdx]; 32. bc[tx][ty] = b[bIdx]; 33. 34. // Synchronze to make sure all threads in block have saved 35. // values to the shared memory for this phase 36. __syncthreads(); 37. 38. for (int i=0; i < TILING_DIM; ++i) { 39. sum += ac[i][ty]*bc[tx][i]; 40. } 41. 42. // Synchronise to make sure that computation are done 43. __syncthreads(); 44. 45. aIdx += TILING_DIM; // Add index by phase dimension 46. bIdx += TILING_DIM*bWidth; // Add index by phase dimension and 47. // b width 48. } 49. 50. // Insert dot-product in resulting matrix 51. c[c_row * bWidth + c_column] = sum; 52. }

Looking at the Cuda kernel the CGMA is calculated by:

39

∗ 1 Where

+1

∶2

stands for block dimension size and is the axis size of one dimension in the block.

stands for global memory read and is the number of accesses to global memory. The block dimension was set to 20, giving a CGMA of 20, and with a giga single-precision data per second of 4.15 for platform #1, the immediate kernel peak performance is calculated to 83 gigaflops. This is an impressive theoretical peak performance of this kernel when taking in to account that the global memory has a bandwidth of 16.6 GB/sec. However the GPU of platform #1 has a peak performance of 34.38 gigaflops and the kernel is limited by that, so the theoretical maximum performance of this kernel on this platform is 34.38 gigaflops. Showing that the kernel on platform #1 is limited to 34.38 gigaflops proves that the tilestrategy kernel algorithm is no longer memory-bound, but actually arithmetic-bound. This is theoretically true, but the picture might be different when the test has been performed and the result is ready. Test and result To test whether using the matrix structure as parameter had any impact on performance in Cuda, I implemented two tiled kernels. The first one used the matrix structure as parameter and the second used pointer arrays. GPU.NET supports shared memory as well, however only arrays with one dimension were supported. So the shared memory indexes in the source code were adjusted, to align the arrays sequentially. Besides this minor adjustment the Cuda kernel was easy to port to GPU.NET and the test result are shown in the following table.

Platform

Kernel running time

Operations/ms

Gigaflops/sec

CPU running time

Cuda Cuda (No struct) GPU.NET

39.52 ms

3,238,368

3.24

257.92 ms

35.96 ms

3,559,595

3.56

254,48 ms

76.00 ms*

1,684,210

1.68

357.00 ms

Table 7 - Test result of matrix-multiplication for tiling strategy on platform #1

* Inclusive data transfer and JIT compilation

40

2 ∗2 =

The block has two dimensions with the length of 20, this gives . The normal recommendation is to make the block size , but the peak performance for the Cuda kernel, block size of 16 ∗ 16 = 256 ℎ dividable by the warp size, currently 32. However these tests were also performed with a was about 1.6 gigaflops. The conclusion is, sometimes it pays of not following the recommendation. In this case, the overhead of filling the warp with empty padded threads is insignificant, when compared to the larger amount of coalesced memory reads, the larger block size results in. By using the tile strategy and shared memory, it was possible to perform matrixmultiplication even faster than the resulting matrix algorithm. Cuda was about 1.47 times faster and GPU.NET was faster by a factor of 1.88. Looking at the peak performance, the result indicates that even though the algorithms are the same, then GPU.NET have no chance of performing on the same level as when Cuda is used directly. This is most likely due to the fact that GPU.NET JIT compiles the device code. The Cuda kernel has a peak performance of 3.56 gigaflops which is remarkably slow, compared to the theoretical 34.38 gigaflops. This gives an actual performance that is just 10.35% of the theoretical possible. And even though the kernel algorithm is arithmetic-bound, due to this significant slower performance, the performance limiting factors can in fact be both arithmetic and memory.

6.3.3 Tiling v2 with latency hiding
When a kernel does not reach the expected performance level, a good place to start is to analyse the kernel with focus on coalesced memory access. Note that even though shared memory is fast, access should still be optimised with regards to coalescing access. The following kernel is the result of such an analysis:
1. __global__ void matrixMultiplicationTILINGns_v2(float* a, float* b, float* c, int aWidth, int bWidth) { 2. 3. // Declare cache 4. __shared__ float ac[TILING_DIM][TILING_DIM]; 5. __shared__ float bc[TILING_DIM][TILING_DIM]; 6. 7. // Calculate Matrix C coordinates 8. const int c_column = blockIdx.x * TILING_DIM + threadIdx.x; 9. const int c_row = blockIdx.y * TILING_DIM + threadIdx.y; 10. const int cidx = c_row * bWidth + c_column; 11. 12. // Calculate the first index in of row in a, and the last for the 13. // current thread 14. const int aIdxBegin = c_row * aWidth + threadIdx.x;

41

15. 16. 17. 18. 19. 20. { 21. 22. 23. 24. 25. 26. 27. 28. 29. 30. 31. 32. 33. 34. 35. 36. 37. 38. 39. 40. 41. 42. }

const int aIdxEnd = aIdxBegin + aWidth - 1; float sum = 0.0; for (int aIdx = aIdxBegin, bIdx = c_column + bWidth * threadIdx.y; aIdx <= aIdxEnd;) ac[threadIdx.y][threadIdx.x] = a[aIdx]; aIdx += TILING_DIM; // Increase a index bc[threadIdx.y][threadIdx.x] = b[bIdx]; bIdx += TILING_DIM*bWidth; // Increase b index // Synchronze to make sure all threads in block have saved // values to the shared memory for this phase __syncthreads(); // Compute dot-product for (int i=0; i < TILING_DIM; ++i) { sum += ac[threadIdx.y][i]*bc[i][threadIdx.x]; } // Synchronise to make sure that computation are done __syncthreads(); } // Insert dot-product in resulting matrix c[cidx] = sum;

To optimisation this v2 kernel, four register variables have been removed to increase the number of possible active warps. Furthermore, the shared memory access in the lines 21 and 25 has been optimised for coalescing. This now yields a peak performance of 5.028 gigaflops on platform #1. Nvidia provides a code example implementation of matrix-multiplication using the tilestrategy. This kernel has a peak performance of 4.91 gigaflops, so with these minor updates it is possible to get a kernel to performing better.

6.3.4 Tiling v3 with prefetching
Access to global memory is limited by bandwidth and high latency. The high latency of access to global memory makes kernel execution halt until the data is served. By organising the code such that when a thread is waiting for data, other instructions can be executed is called latency hiding. One way of hiding latency from memory access is to exploit data prefetching. This basically works by pre-fetching data while the current data is being processed. The following pseudo code shows the steps of using pre-fetching for matrix-multiplication:

42

1. __global__ void mm_prefecth(float* a, float* b, float* c, int aWidth, int bWidth) { 2. 3. // Load data from global memory to register variables 4. 5. while(data to process) { 6. 7. // Insert register values to shared memory 8. 9. // Synchronze threads 10. 11. // Prefetch next values to register 12. 13. // Calculate the dot-product 14. 15. // Synchronise to make sure that computations are done 16. } 17. 18. // Insert dot-product in resulting matrix 19. }

By exploiting prefetching the peak performance of matrix-multiplication increased for system #1 and #4. 5.68 gigaflops was reached for system #1, while system #3 surprisingly incurred a performance loss of 2.2 gigaflops.

6.3.5 Tiling v4 and v5 with more output per thread
Vasily Volkov from UC Berkeley has looked deeper into “Better performance at Lower Occupancy” [6]. He has shown that it is possible to increase performance of matrixmultiplication by making a thread compute several dot-products, instead of one. Volkov points in particular two things out. By making a thread compute more output, a thread can reuse values in the register for several computations. Registers are faster than both shared and global memory and by grouping the work of several threads together, it is possible to use the register for data sharing. The hypothesis is that for memory heavy kernels, it is beneficial to let fewer threads carry a higher workload, to exploit the fast register for data sharing. In the matrix-multiplication kernel this has another advantage, as the dot-product is being calculated between a single column in B and several rows in A, the memory read access to this specific column in B is reduced by 3 . Consider the following inner loop, which computes the dot-product for the different rows. The green memory read operations is the same and hence automatic cached.

43

1. // Calculate the dot-product 2. for (int i=0; i < TILING_DIM; ++i) { 3. sum[0] += ac[threadIdx.y][i] * 4. sum[1] += ac[threadIdx.y+5][i] * 5. sum[2] += ac[threadIdx.y+10][i] * 6. sum[3] += ac[threadIdx.y+15][i] * 7. }

bc[i][threadIdx.x]; bc[i][threadIdx.x]; bc[i][threadIdx.x]; bc[i][threadIdx.x];

I made two tests, one where a single thread computes two dot-products, and another where a thread computes four dot-products. The results of platform #1 were a bit surprising and have therefore also been tested on platform #4.

Kernel/System Tiling v3 Prefetching Tiling v4 2 outputs/thread Tiling v5 4 outputs/thread

#1 (CC v1.1) 5.68 gigaflops (Occupancy: 54%) 4.45 gigaflops (Occupancy: 58%) 5.89 gigaflops (Occupancy: 67%)

#3 (CC v1.3) 113.52 gigaflops

#4 (CC v2.0) 34.57 gigaflops (Occupancy: 81%)

96.51 gigaflops

36.89 gigaflops (Occupancy: 88%)

113.83 gigaflops

37.13 gigaflops (Occupancy: 67%)

Table 8 - Tiling with 2 and 4 outputs per thread comparison for different platforms

The performance expectations set by the Volkov slides say that several outputs/thread could perform better than 1 output/thread. The results from platform #4 are the only one following this pattern, ending with a peak performance of 37.13 gigaflops. This performance is achieved by having an occupancy level of 67%, confirming that higher performance can be achieved with a lower occupancy rate. But it is evident that using Volkov’s suggestion is not a certain measure for success. It is interesting to see that the performance of the 2 outputs/thread method actually was lower than expected for system #1 and #3, and the 4 outputs/thread on these systems yielded results similar to those when not using Volkov’s suggestions at all. The major difference between system #1 and #3 on one side and #4 at the other is the different GPU devices compute capability level. System #1 and #3 belongs to 1.x where #4 belongs to the 2.x generation. This might be the reason that the algorithm performs relatively different on different architectures. I will in the next paragraph look into whether the CC level has any influence on the performance of al algorithm.

44

6.3.6 Cuda compute capability
The features, capabilities and instruction set of a GPU are specified by its compute capability. The first Nvidia graphic cards were released with a compute capability level from v1.0 to v1.1, the newest card today are released with a CC level at 2.1. A higher level defines , a superset of features of those at a lower lev [4], and a higher level also indicates a newer level generation GPU. The newer CC levels both add new features and improve on existing. One of the important factors in performance is the performance impact of non-coalescing m memory access. This has been improved in many ways for CC 2.0 and 2.1 which means that the developer does not 2.1, need use as much energy on designing an algorithm with memory coalescing in mind, making it easier to port existing algorithms. The following table shows the peak performance in gigaflops of different kernels targeting different CC levels. The results can also be found in appendix F.

40 35 30 Gigaflops 25 20 15 10 5 0

CC 1.1 in gigaflops

CC 1.3 in gigaflops

CC 2.0 in gigaflops

Figure 7 - Performance of kernels executed for different CC levels on platform #4

45

It was expected that the kernels would be best performing on higher CC levels, and the rule of thumb is to target the highest CC level possible when compiling kernels, to take advantage of the newest optimisations and features. This is true except for the last kernel, where CC levels 1.1 and 1.3 have a peak performance of 39.47 gigaflops, which is faster than the 37.06 gigaflops that CC 2.0 delivers. Kernels executed on CC levels 2.0 are in generally between 5% and 10% faster than lower levels, except the last case described above that is 6.11% slower. I have not been able to find any explanation why the rule of thumb is not valid for this specific case, but even though this exception breaks the rule, I do still recommend compiling for the highest CC level possible.

6.4 Evaluation
The key to a good performing kernel is memory coalescing and latency hiding. The first step should be to structure the algorithm so that most possible memory coalescing is achieved. Tiling has proven a very good strategy for increasing memory coalescing. A memory access limitation is due to the DRAM memory design; so memory coalescing optimisation techniques should also be applied to shared memory. Using matrix-structures as parameters was initially thought of as a good abstraction, but test shoved that performance losses where the result. It is therefore recommended to use primitive variables or pointers in the kernel function definitions. Data prefetching combined with operations reordering in the kernel was used to hide latency, and gave diverging results. On three systems a performance of about 1 gigaflop was achieved, but on the Tesla C1060 device, a loss of 2.2 gigaflops was the result. Maybe the hardware of the Tesla card is already optimised with regard to hiding this type of latency, so trying to handle this in the kernel counteract these hardware optimisations. In any way, this shows that data prefetching should be carefully applied to a kernel. Volkov presented ideas that latency-hiding and the exploitation of registers can achieve a higher peak performance. This proved to be true, and by making a thread do more work was it possible to make a kernel perform even better. This optimisation technique furthermore showed that the occupancy rate should not necessarily be relied on for an optimisation strategy. Control flow is another factor to keep in mind when designing a kernel. As the Cuda architecture is a SIMT, and if a kernel has a complex control flow, then several runs by the warp scheduler can be necessary to complete the warp. This can unwarrantedly result in a longer computing time.

46

The numerical precision for the same operations processed on a GPU and a CPU does not always yield the same result. This is especially true for older architectures that have a compute capability levels between v1.0 and v1.3. Newer architectures better support the IEEE 754 standard and yields in many cases a result with better precision. In the tests of matrix-multiplication the maximum differences in values was 0.013. To minimise this inaccuracy, special and slower intrinsic functions can be used in the kernel. These functions have less deviations from IEEE 754 and forces the compiler not to use FMAD instructions, which are fast multiply-add instructions, but imprecise.

47

7 LU decomposition
LU-decomposition, also called LU-factorisation, is a linear algebra matrix decomposition of a matrix A in the form: = Where L and U are lower and upper triangular matrices [17]. If the LU factorisation is known, it can be used to solve matrix-vector linear equations in two steps: =

1: 2:

=

=

Decomposing matrix A to a product of L and U can be achieved by using an enhanced version of Gauss elimination. Only a square matrix can be decomposed, and the L and U matrices are of the same size, as shown here: = 1 0 1 0 0 1 0 0

0 of L are all one. Stewart

Note that L is a unit one matrix, meaning the diagonal elements unit one matrix.

provides a sequential algorithm that builds on Gauss elimination, which also creates a lower

7.1 Analysis
Stewart designs an LU-decomposition algorithm and provides the code that overwrites the matrix A with its LU factorization [17]. Pivoting or row interchanges may be required for two reasons, firstly to ensure the existence of a LU factorisation and secondly to increase the numerical stability of the Gaussian elimination algorithm [18]. For simplicity algorithms without pivoting will initially be analysed, but when testing, only algorithms that implement partial pivoting will be used. This makes sure that the performance and correctness of the individual algorithms can be compared.

48

7.1.1 The sequential algorithm
The sequential algorithm, without pivoting, consists of three loops that overwrite the existing matrix m. The matrix is vectorised and the index of an element is found by is the row index, w the width of the matrix and c the column index. where r ∗ +

1. // Core algorithm for LU Decomposition 2. for (int k = 0; k < n; k++) 3. { 4. for (int i = k + 1; i < n; i++) 5. { 6. // Compute scale factor Rik 7. float Rik = (m[i * mWidth + k] /= m[k * mWidth + k]); 8. 9. // Subtract row k elements from row i elements with the 10. // Rik scale factor 11. for (int c = k + 1; c < n; c++) 12. { 13. m[i * mWidth + c] -= Rik * m[k * mWidth + c]; 14. } 15. } 16. }

The code shown above is without pivoting for simplicity reasons. The sequential implementation is simple, consisting of three loops using the operations divide, multiplication and addition. The data size of the square matrix is multiplications, and time is O( 2 divisions to complete. Ignoring the lower order term, the running ) where n is both the width and height of the matrix [19]. This shows that the and requires 3 additions and

running time increases more than the data size.

7.1.2 Parallelism
Matrix-multiplication and the characteristics of its data access meant that inducing concurrency and exploiting data-parallelism was straightforward. The same cannot be said about LU-decomposition, in which data dependencies between the loops makes parallelising more complicated. The sequential algorithm consists of three loops, and the operations performed by the two inner loops results in a asymptotical running time as shown: = parallel, as the 2+ 3= ( , ) ... iterations.

Where n is the width or height of the matrix. The outer loop cannot be directly performed in iteration depends on the results from the

49

Parallelism is not impossible, but the order of the outer loop is vital. Taking the outer loop into account, the operations part of the algorithm can be written as: 2+ 3

2+ 3 operations are performed, and these operations can be performed in parallel. The required number of operations when taking This equation says, for each step then parallelisation into account is: ∗ where = . 2+ 3 = − 1. But for simplicity, )

is the number of processes. The optimal execution performs all tasks possible in

parallel, for this algorithm the optimal number of processes is let’s set it to 2+

= can be utilised by the Cuda architecture. Multipliers and row operations

3= (

So this algorithm does have a parallel potential, I will now look into whether this potential

The parts of the algorithm that exhibit no or little parallelism should be processed on the CPU. This means the outer loop is processed by the CPU and the inner parts that exhibit parallelism will be processed by the GPU. So the LU-decomposition implementation should be processed by the CPU and GPU in correlation. The interesting point will be to see if it is possible, when also considering data transfer, to make the GPU assist the CPU, to accelerate the execution of the LU-decomposition algorithm. To make the initial implementation simple, I have divided the calculation of multipliers and the row operations into separate tasks. For each step of the outer loop, the multipliers are calculated for the current column (line 2 to 4), after which the multipliers are used to calculate the elements in the upper triangular matrix (line 5 to 9). These are individual task that can be performed in parallel as shown in the following pseudo code.

50

1. For 2. 3. 4. 5. 6. 7. 8. 9. 10. End

k from 1 To n-1 For i in { k+1, ...,n } LU[i][k] = LU[i][k] / LU[k][k] End For j in { k+1, ...,n } For i from k+1 To n LU[i][j] = LU[i][j] – LU[i][k] * LU[k][j] End End

This algorithm is not the only method for creating the LU-factorisation, there are other algorithms that structure the operations in the outer loop differently, and the main difference is their memory access patterns. The performance of using different memory access patterns may wary depending on different memory types used in the algorithm, another factor is the whether the tasks are fine-grained or coarse-grained. For now I recognise the existence of other algorithms, but use the one described for the simple implementation.

7.2 Simple algorithm
A simple version of the LU-decomposition algorithm was implemented and tested. The optimisation steps uncovered from the matrix-multiplication implementations on Cuda, will be used to optimise the LU-decomposition implementation. Performance and correctness tests will be performance and compared with different CPU algorithms, including with and without pivoting.

7.2.1 The algorithm
The sequential algorithm described in the analysis is used to calculate the LU matrices on the CPU, and serves as a comparison for the GPU computed result. This algorithm does not allow the same level of parallelism as matrix-multiplication, but parts of the algorithm can be parallelised. The following sample shows the code that performs the outer loop on the host, and makes calls to kernels processed by the device. Line 7-9 shows an optional call to a device pivoting kernel with a running time of O( ). This composition ensures that the order of the outer loop is maintained, and the remaining tasks are performed in parallel.
1. for (int k = 0; k < (int)a->width; k++) { 2. 3. // setup execution parameters, for (int i = k + 1; i < n; i++) 4. int threads = a->width - k; 5. int gridX = (threads + THREADS_PER_BLOCK-1) / THREADS_PER_BLOCK; 6. 7. if (pivot) { 8. lud_simple_pivot<<< 1, 1 >>>( d_lu, lu->width, lu->height, k);

51

9. 10. 11. 12. 13. 14. 15.

} // Calculate scale factors for column k lud_simple_calc_scale_factor<<< gridX, THREADS_PER_BLOCK >>>( d_lu, lu->width, lu->height, k); // Calculate new columne values with scale factor lud_simple_compute_row<<< gridX, THREADS_PER_BLOCK >>>( d_lu, lu>width, lu->height, k);

16. 17. }

The function call in line 12, calculates the multipliers of a given column on the device. Line 15 performs the row operations with the multipliers. The kernels being called and their logic are shown here:
1. __global__ void lud_simple_calc_scale_factor(float *lu, int luWidth, int luHeight, int k) { 2. 3. int tid = threadIdx.x + blockIdx.x * blockDim.x; 4. int i = k + 1 + tid; 5. 6. if (i < luHeight) 7. { 8. // Calculare rik scale factor and insert to Lower triangle 9. lu[i * luWidth + k] /= lu[k * luWidth + k]; 10. } 11. } 12. __global__ void lud_simple_compute_row(float *lu, int luWidth, int luHeight, int k) { 13. 14. // Id of the row 15. int tid = threadIdx.x + blockIdx.x * blockDim.x; 16. int i = k + 1 + tid; 17. 18. if (i < luHeight) { 19. 20. // Load rik scale factor, can be cached in shared memory 21. float rik = lu[i * luWidth + k]; 22. 23. // Subtract row k elements from row i elements with the Rik scale factor 24. for (int c = k + 1; c < luWidth; c++) 25. { 26. lu[i * luWidth + c] -= rik * lu[k * luWidth + c]; 27. } 28. } 29. }

52

7.2.2 Test and results
Tests were performed with the data sizes 400, 2000, 4000, 6000 and 10,000, where the data size is both the width and height of the matrix being decomposed. The element values are randomly generated, which unlikely though can mean that there is no LU-factorisation result. Pivoting, meaning row interchanges, are applied to both ensure numerical stability, but also to ensure that a LU-factorisation do exist. Multipliers and row operations The test was performed on the four different platforms, as shown in the following graph.

0,60 0,50 0,40 Gigaflops 0,30 0,20 0,10 0,00 0 1.000 2.000 3.000 4.000 5.000 Data size Platform #1 Platform #2 Platform #3 Platform #4 6.000 7.000 8.000 9.000 10.000

Figure 8 – Performance of simple LU-decomposition on different platforms. size. The performance increases for matrices of increasing sizes up to about 2000 × 2000, and The graph indicates that the peak performance measure in gigaflops is dependent on the data after that the performance result levels for all platforms. The peak GPU performance of the gigaflops is slow.

fastest platform was about 0.5 gigaflops, which compared to a peak CPU performance of 2.44

53

Kernel invocation overhead If pivoting is used, then for each step

10,000 matrix this equals to 30,000 kernel invocations. Naturally, kernel invocations will incur overhead, but how much will 30,000 invocations influence the total result? Vasily Volkov et al. has measured the kernel launch overhead for various systems and GPUs [20]. For synchronised kernel invocations the times measures were between 10-14 s, for asynchronous kernel invocations the timings were 3-7 s. The following table shows the kernel invocation overhead as a ratio of the fastest running times on system #3 and #4.

, three kernels are invoked. For a 10,000 ×

System Fastest result (n=10,000) Asynchronous (low=900 ms) Asynchronous (high=2,100 ms) Synchronous (low=3,000 ms) Synchronous (high=4,200 ms)

#3 15,736.55 ms 5.72% 13.34% 19.06% 26.69%

#4 32,784.64 ms 2,75% 6.41% 9.15% 12.81%

Table 9 - Kernel invocation overhead ratio of total running time

This table shows that kernel invocations should not be disregarded when implementing an algorithm, because their contribution to the total running time can be relatively high. This obviously depends on the algorithm, but for this LU-decomposition implementation, the contribution is as high as 26.69%. It is evident that asynchronous invocations are faster than synchronous, and hence represent a lower percentage of the total running time. So where possible asynchronies functions should be used GPU.NET The LU-decomposition implementation requires a matrix to initially be copied to device memory, then several kernel calls compute the result and updates the data, before the matrix is copied back to the host. GPU.NET does currently not allow data on the device to be modified by multiple kernel calls, so testing LU-decomposition through GPU.NET would not be relevant, as data transfers would severely impact performance.

54

7.3 Block LU-decomposition
One of the first optimisation strategies, suggested by the matrix-multiplication chapter, was to divide the problem into smaller pieces that fit into caching memory. Jack Dongarra et al. have developed a block LU algorithm, called the right looking algorithm, that automatically supports tiling. Please note that the term block in a block algorithm is not the same as the blocks that are part of a Cuda grid, and used to define thread granularity. A Cuda block will from here on be referred to as a thread block.

7.3.1 The block algorithm
By partitioning a × matrix A, the factorisation LU may be partitioned as shown [22]:

The usual rules of matrix-multiplication hold for block matrices, so we can write: 1. 2. 3. 4. = = = =

Where × is the block size, ) × ( − ). and

is × ,

is × (

+

− ),

is (

− )×

and

is (

The first step is based on lemma 1 and 2, by performing a normal LU-decomposition on combined, the result is then , and , which are then known. . and

Step 2 uses lemma 3 and a triangular solve method, which results in the matrix In step 3, rearranging lemma 4 gives can be found by LU-decomposing on In the . = − =

, which shows that

. This can be achieved by using the above steps

decomposition, as depicted here. The white parts have already been solved.

/ number of steps the matrix A has been decomposed by using a block LU-

55

Figure 9 – Matrix A being decomposed by block LU-decomposition in steps.

Figure 9 shows that the height and width of the matrix is

,

is the current step and block

width is obviously the dimension of the current block being processed (here the green submatrix). Step 1 solves the green and purple sub-matrix by regular LU-decomposition, then the lower triangular matrix of the block (L of the green block) is used to triangular solve the cyan sub-matrix, as the second step. In the third step, the blue sub-matrix is found by regular matrix-multiplying the purple and cyan sub-matrices and subtracting the element values from the current elements in the blue sub-matrix. The steps are then continued for the remaining parts until the whole matrix is processed. ×

As this algorithm makes it possible to partition large matrices and solve smaller parts, and therefore exploit shared memory, this algorithm will be implemented using Cuda and used for testing.

7.3.2 Implementation
This algorithm, as shown above, consists of three steps, which the implementation must also follow. The first step, to LU-decompose simple algorithm (lud_block_scale). This part also includes the optimised pivoting kernels (lud_block_pivot, lud_block_pivot_L2 and lud_block_swap). × , is covered by an optimised kernel of the

56

The second step requires a triangular solving kernel (lud_block_triangular_solve), and the last step is regular matrix-multiplication kernel (lud_block_matrixMultiplication), which has already been implemented and optimised in chapter 6 from page 29. Pivoting In LU-decomposition, pivoting is performed for each column. Instead of the simple algorithm that had a running time proportional to , the parallel nature of Cuda can be exploited to implement a reduction pivoting algorithm with a running time of O(log(n)). This required two pivoting kernels; the first reduces the current column of the matrix and saves the result to a temporary pivoting array on the device. The second kernel does the same, but works on the temporary pivot array instead of the matrix. The first kernel is shown here, and has already been optimised with focus on memory coalescing and a sort of tiling strategy. In line 20 and 21 the individual threads loads a value from global memory to shared memory. This data is then processed from line 27 to 37, while threads synchronise data access for consistency. In line 42, the first thread of each thread blocks in a grid, writes the pivoting index to the temporary array.
1. __global__ void lud_block_pivot(int *out, float *a, int M, int k, int max) 2. { 3. extern __shared__ float shared[]; 4. float* max_cache = (float*)shared; 5. int* idx_cache = (int*)&shared[blockDim.x]; 6. 7. unsigned int tx = threadIdx.x; 8. unsigned int i = blockIdx.x * blockDim.x + tx + k; // Get row index 9. 10. unsigned int idx = i * M; 11. 12. // Clear cache for threads that exceeds max + they should not 13. //influence result 14. max_cache[tx] = 0; 15. idx_cache[tx] = -1; 16. 17. if (i < M) 18. { 19. // Read value + set row index 20. max_cache[tx] = abs(a[idx + k]); 21. idx_cache[tx] = i; 22. 23. // Sync threads to make sure all other also have loaded values 24. __syncthreads(); 25. 26. // Do the actual pivot finding 27. for(unsigned int stride = blockDim.x/2; stride>0; stride>>=1) 28. {

57

29. 30. 31. 32. 33. 34. 35. 36. 37. 38. 39. 40. 41. 42. 43. 44. 45. }

if (tx < stride && (stride+tx+k) < M && max_cache[tx] < max_cache[tx + stride]) { max_cache[tx] = max_cache[tx + stride]; // Update value idx_cache[tx] = idx_cache[tx + stride]; // Update index } // Sync threads __syncthreads(); } // The first thread should write result from block to output if (tx == 0) { out[blockIdx.x] = idx_cache[0]; // Load index to output } }

Swapping rows If a pivoting row has been identified, the indices of the two rows are then transferred to the device, by calling a kernel for swapping rows. By swapping the rows on the device, a transfer of the matrix to and from the host is avoided. Several threads can be used to swap the rows, and by aligning the memory access correctly, the memory access is coalesced. LU-factorisation The algorithm described by Stewart [17] is parallelised by making threads process individual all threads, as the grid size is 1 × 1. This means that the rows. To optimise the performance the ℎ row is loaded to shared memory and accesses by

accessed by all threads. With just one block, the disadvantage is that Cuda are not able to hide memory latency access by switching to other active blocks. I will later test whether this approach performs well, or another approach with several thread blocks performs better. Triangular solving Triangular solving for can be performed row- or column wise. I have chosen column wise

ℎ row is only loaded to shared memory ones, and the values are

as the memory access of the threads in a block will be coalesced. This part is well suited for GPU processing, because each column can be processed independently.

58

1. __global__ void lud_block_triangular_solve(float *a, int M, int k, int LU_BlockDim) 2. { 3. extern __shared__ float y[]; 4. 5. int tx = threadIdx.x; 6. int tid = blockIdx.x * blockDim.x + tx; 7. int column = tid + k + LU_BlockDim; 8. 9. if (column < M) 10. { 11. for (int r = 0; r < LU_BlockDim; r++) // For each row in block 12. { 13. float res = a[(r+k) * M + column]; 14. for (int c = 0; c < r; c++) // 0<=c<r, so below diagonal 15. res -= a[(r+k) * M + c + k] * y[tx * LU_BlockDim + c]; 16. y[tx * LU_BlockDim + r] = res; 17. } 18. 19. for (int r = 0; r < LU_BlockDim; r++) 20. a[(r+k) * M + column] = y[tx * LU_BlockDim + r]; 21. } 22. }

Each thread uses shared memory to calculate the resulting column values (line 3, 15, 16 and 20). The size of the thread blocks, and thereby the needed shared memory, is not known compile time. But as shared memory can be dynamically allocated, this is not a problem. Shared memory is fast, but the register is even faster. If the thread block size was known on compile time, it could prove beneficial to use the register instead of the shared memory. Matrix-multiplication The kernel for performing matrix-multiplication is based on tiling v3, which includes tiling, pre-fetching and memory coalescing optimisations. For any details about this kernel please turn to paragraph 6.3.4 on page 42.

7.3.3 Test and results
ranging from 400 × 400 to 10,000 × 10,000 matrices. For comparison the same data sizes was The initial block algorithm was tested on the four different platforms and with data sizes tested on the CPU.

The test with the largest matrix was not performed on platform #1, due to memory limitations, neither was it tested by the CPU, as the running time would be too high. For had a peak performance of 14.37 gigaflops for a matrix 10,000 × 10,000. comparison I have added a qualified projection on the graph, which shows that platform #3

59

16,0 14,0 12,0 Gigaflops 10,0 8,0 6,0 4,0 2,0 0,0 0 1.000 2.000 3.000 4.000 5.000 Data size Platform #1 Platform #2 Platform #3 Platform #4 CPU 6.000 7.000 8.000 9.000 10.000

Figure 10 - Performance of block LU-decomposition v1 on different platforms.

The graph shows two important things. Firstly, the GPU architecture is in fact able to perform LU-decomposition faster than the CPU, for larger matrices. The specific speed of the platform determines when and for which data sizes, the GPU is faster, but looking at the graph, this happens somewhere between 1,000 and 3,000. Secondly, the peak performance of the algorithm is almost proportional to the data size, for these tests. Obviously the peak performance cannot keep increasing proportionally to the data size; there must be an upper limit. But it makes sense that when parallel. Several of the tests were also performed by the CPU as a comparison. The average difference was 0.104211 and the maximum and minimum differences were respectively 0.332855 and 0.0. The 0.0 differences were only achieved on platform #4 with a compute capability of 2.1. Profiling The Nvidia Compute Visual Profiler is a tool that allows profiling of a Cuda program. The GPU time summary plot indicates which part of an algorithm that could be optimised with most effect. The following figure shows how much computing time each kernel uses. increases, even more operations can be performed in

60

Figure 11 - Computing time of each kernel in block LU-decomposition v1 on platform #4.

Almost 90% of the time is spent in the regular LU-decomposing kernel, so optimising this part should have the best effect on the total running time.

7.3.4 Optimising round 1
Having determined that the kernel lud_block_scale has the best performance optimisation potential, I will now analyse the source code to identify performance limiting factors. The kernel is, for each ℎ iteration, called with one thread block and the LU block size

in threads, which normally is 20. This means that only 20 threads are running in parallel at any given time, but as the warp size for the G80 and GT200 architecture is 32, the active warp is padded with 12 empty threads that do not process any data.
1. __global__ void lud_block_scale(float *a, int M, int k) 2. { 3. extern __shared__ float ac[]; 4. 5. int aWidth = M; 6. int tx = threadIdx.x; 7. int end = min( blockDim.x, M-k ); 8. 9. ac[tx] = a[k * aWidth + k + tx]; // Load k row to shared memory, as // it is used across threads 10. 11. // Sync threads to make sure all other also have loaded values 12. __syncthreads(); 13. 14. for(int i = k+1 + tx; i < M; i+=blockDim.x) { // Foreach row 15. 16. // Compute scale factor Rik, 1 operation=divide 17. float rik = (a[i * aWidth + k] /= ac[0]); 18. 19. for (int c = 1; c < end; c++) // Foreach column value in row

61

20. 21. 22. }

a[i * aWidth + k + c] -= rik * ac[c]; }

Another factor in this kernel is its dependency on global memory. All threads load a value from the and the write to shared memory is coalesced, so this is good. The cached values are then heavily on global memory access, and have only few operations to hide latency. CGMA

ℎ row into shared memory in line 9, both the memory read from global memory

used to calculate the upper and lower triangular matrices from line 14 to 20. These loops rely

The core parts of the kernel are line 17 and 20. Consider line 17, a memory load and a write, combined with a single divide operation gives a CGMA of 0.5. Line 20 has a memory load and a write, together with the operations addition and multiply, which gives a CGMA of 1.0. On expected giga single-precision data per second is 4.15 (16.6 / 4). Line 20 is the dominant part, platform #1 the global memory has a peak performance of 16.6 GB/sec bandwidth. The

but line 17 cannot be ignored, so taking the CGMA ratio of 1.0 and 0.5 into account, this

kernel will execute at no more than between 2.075 and 4.15 gigaflops on platform #1 [4]. Hiding latency The low CGMA suggests that this kernel is limited by memory, but in the current form, it is possible to improve on latency hiding. One way is to assign more work to the streaming processors and let them continue working on another warp, while the first warp waits for data. The updated thread block size is 64, and the needed number of blocks would be ( like this: − ) / 64, where M is the height the matrix and k the current iteration. The kernel looks
1. __global__ void lud_block_scale_v2(float *a, int M, int k, int end) 2. { 3. extern __shared__ float ac[]; 4. 5. int aWidth = M; 6. int tid = blockIdx.x * blockDim.x + threadIdx.x; 7. 8. // Load k row to shared memory, as it is used across threads 9. ac[threadIdx.x] = a[k * aWidth + k + threadIdx.x]; 10. 11. // Sync threads to make sure all other also have loaded values 12. __syncthreads(); 13. 14. int i = k+1 + tid; // Row index 15. if (i < M) 16. { 17. // Compute scale factor Rik, 1 operation=divide 18. float rik = (a[i * aWidth + k] /= ac[0]);

62

19. 20. 21. 22. 23. }

for (int c = 1; c < end-k; c++) // Foreach column value in row a[i * aWidth + k + c] -= rik * ac[c]; }

But this is not the only benefit from this update. A for loop is a control flow element that often is part of a kernel. When doing operation counting analysis of a kernel, the operations contributed by for loops are often overlooked. Consider line 20 in the kernel above, for every iteration the c++ operation and the c < end-k comparison is performed. Unrolling loops are another way of increasing performance of a kernel, and this is exactly what has been achieved with this kernel, compared to the former version’s line 14.

7.3.5 Test and results
sizes ranging from 400 × 400 to 10,000 × 10,000 matrices. For comparison the same data sizes This updated block algorithm was tested again, on the four different platforms and with data was also tested on the CPU.

035 030 025 Gigaflops 020 015 010 005 000 0 1.000 2.000 3.000 4.000 5.000 Data size Platform #1 Platform #2 Platform #3 Platform #4 CPU 6.000 7.000 8.000 9.000 10.000

Figure 12- Performance of block LU-decomposition v2 on different platforms.

63

This optimised algorithm is faster than the former, now the peak performance of platform #3 is 31.51 gigaflops, which is 2.19 times as fast. This also shows that even smaller matrices can with benefit be processed by the GPU architecture. Several of the tests were also performed by the CPU as a comparison. The average difference was 0.103360 and the maximum and minimum differences were respectively 0.314285 and 0.002296. The Cuda architectures with higher compute capability do not seem to have smaller deviations from the CPU reference result. Profiling Focus should be on hiding latency, which can be achieved by increasing the number of active blocks or by data pre-fetching. The following graph shows the updated computing time, when the number of threads and blocks has been adjusted.

Figure 13 - Computing time of each kernel in block LU-decomposition v2 on platform #4.

This profiling result indicates that optimisation of the lud_block_scale and the
lud_block_matrixMultiplication kernels could have the highest performance effect. So this

is what I will look into as next step.

7.3.6 Optimising round 2
The matrix-multiplication kernel used has already been well optimised, as it is based on the work and results from the matrix-multiplication analysis chapter 0 (from page 29). But for simplicity reasons, this kernel was not optimised with Volkovs suggestion, several outputs per thread. This will be the next step to test. The details of how this was implemented have already been well described, so please refer to the chapter about matrix-multiplication for any details about this optimisation.

64

The result of the optimised algorithm is shown below. The peak performance, for platform #3, is now 42.89 gigaflops for a matrix, which is an increase of 1.32 tim ch times.

045 040 035 030 Gigaflops 025 020 015 010 005 000 400 2.000 4.000 Data size Platform #1 Platform #2 Platform #3 Platform #4 CPU 6.000 10.000

Figure 14 - Performance of block LU decomposition v3 on different platforms. LU-decomposition

7.3.7 Further optimisation
The algorithm consists of 6 kernels covering pivoting and the three steps in block LU LUdecomposition. Pivoting, regular LU LU-decomposition and matrix-multiplication has been multiplication optimised, the only kernel left in its original version is lud_block_triangular_solve, which will be one of two parts for an optimisation attempt. The other part is the optimised kernel
lud_block_scale that still accounts for the highest GPU time consumption he consumption.

Triangular solve The first version of triangular solve implementation was well balanced with regard to grid regards and thread block size, and even though, according to the profiling result above, an optimisation only would affect about 3% of the GPU running time, I chosen to try and optimise ly this kernel further. Being fully aware about any optimisation would have a limited impact on peak performance; this would still be a good exercise and give insight to the analys and analysis improvement of a kernel.

65

The focus was on unrolling a loop (line 20 was in former version performed in separate loop) and coalescing memory access (line 17 and 19 are now coalesced), and as expected the optimisation did not yield any significant change in peak performance.
1. __global__ void lud_block_triangular_solve_v2(float *a, int M, int k, int LU_BlockDim) 2. { 3. extern __shared__ float y[]; 4. 5. int tid = blockIdx.x * blockDim.x + threadIdx.x; 6. int column = tid + k + LU_BlockDim; 7. 8. if (column < M) 9. { 10. for (int r = 0; r < LU_BlockDim; r++) // For each row in block 11. { 12. int rkM = r+k*M; 13. float res = a[rkM + column]; 14. 15. for (int c = 0; c < r; c++) // 0<=c<r, so below diagonal 16. res -= a[rkM + c + k] * 17. y[c * LU_BlockDim + threadIdx.x]; 18. 19. y[r * LU_BlockDim + threadIdx.x] = res; 20. a[rkM + column] = res; 21. } 22. } 23. }

The result was limited as expected, but if I were to improve this kernel further, then I would focus on testing whether Volkov’s suggestion (1 thread = 2 output) would have any positive effect. Another improvement would focus on the global memory access in lines 13 and 16. The elements of the lower triangular matrix of the current block being processed (L of the orange sub-matrix) could be copied to shared memory and the reads in line 13 and 16 could be from the faster shared memory instead of global memory.

Figure 15 – Showing the sub-matrix part of the triangular solve method.

66

Regular LU-decomposition Focusing on the kernel that accounts for the highest GPU consumption time makes sense if this part of the algorithm can be further optimised. The former version of the kernel focused on hiding latency by increasing the number of thread blocks. Hiding latency can also be achieved by using data prefetching and by minimising the need for global memory access. The performance of the kernel lud_block_scale was attempted to be improved by using registers to hold indices computed several times, by using data prefetching and by applying the Volkov suggestion (1 thread = 2 outputs). Unfortunately no performance gains were established compared to the former versions, intact a minor performance loss proved to be the reality.

Platform #3
42,4 42,3 42,2 42,1 42,0 v4 v5 Kernel editions v6 20,4 20,3 20,3 20,2 20,2 Gigaflops Gigaflops

Platform #4

v4

v5 Kernel editions

v6

Figure 16 – A 10.000 x 10.000 matrix LU-decomposed on platform #3 and #4.

So based on this results, sometimes when applying an optimisation method, the result is actually a performance loss. The reason for this is covered by the fact that these methods add restrictions to the kernel, which results in extra boundary checks being needed with an increase in flow control complexity and operations. So implementing improvements with care followed by testing should always be exhibited to determine whether the improvement is actually needed, to yield a better performance. Correctness Several tests with different data sizes were performed by the CPU as a comparison for GPU computed results. The average deviation was 0.062963 and the maximum and minimum were respectively 0.309113 and 0.0. The 0.0 was only achieved by the platform #4 with a Cuda compute capability of 2.1. The deviations increased proportionally to the data size, which

67

make good sense. Any inaccuracy effects the result for every iterations, the larger the matrix size, the more iterations are needed.

7.3.8 Large matrices
The test results have so far indicated that the peak performance of the block LUdecomposition was proportional to the data size, but there must be a maximum where the architecture limits performance. matrices with sizes from 400 × 400 and up to 20,000 × 20,000. A matrix of this size requires My curiosity drove me to find this limit, so I updated the program to support very large

about 1525 MB of both host and device memory, which only platform #3 matches with 4GB.

60,0 50,0 Gigaflops 40,0 30,0 20,0 10,0 0,0 0 5.000 10.000 15.000 20.000 25.000 30.000 35.000

Data size Platform #3

Figure 17 - Peak performance of LU-decomposition v3 on platform #3 results I have added a qualified projection for matrices up to 35,000 × 35,000 in size. The 51-52 gigaflops on system #3. It is difficult to calculate the theoretical performance of the LU-decomposition block implementation, because the computations are divided into 6 different kernels. Each kernel has its own CGMA and its share in solving the full problem, but I will try and approximate. The kernels with a CGMA between 0.5 and 1.0 takes up 48% of the running time, the matrixmultiplication kernel has a CGMA of 20 and make up about 17%. The remaining kernels have a

The graph above shows the peak performance in gigaflops of different data sizes. From the result and projections show that the v3 algorithm should have a peak performance of about

68

CGMA of about 1.0. These numbers have been retrieved from the Nvidia profiler shown in Figure 13, and the approximated ranged result is found by these two equations. , 0.5 ∗ 48% + 20.0 ∗ 17% + 1.0 ∗ 35% = 3.99 ≈ 4.0

102.4 GB/sec, which gives a giga single-precision data per second of 25.6 (102.4 / 4). The coalesced memory access. The actual is about half, namely 52 gigaflops.

,ℎ ℎ

1.0 ∗ 48% + 20.0 ∗ 17% + 1.0 ∗ 35% = 4.23

So the approximated CGMA is 4.0. The peak global memory performance for platform #3 is theoretical peak performance of this algorithm on this platform is 102.4 gigaflops, for fully

There are several factors influencing this result, one is the fact that not all memory load and writes are coalesced, another factor is the extra instructions processed due to control flow complexity. But the result indicates that memory access is not a limiting factor on performance for these kernels, but something else is.

7.4 Evaluation
LU-decomposition algorithms have given some valuable insight to some optimisation methods that work and some that does not. Reducing the number of kernel invocations or using asynchronous functions, can reduce the total running. With 30,000 kernel calls, the total invocation time could be reduced from 3-4.2 seconds to 0.9-2.1 seconds. I do not think that reducing kernel calls should be a primary focus, but just something that a developer should be aware of when implementing an algorithm for Cuda. To base the implementation on a block algorithm increased the performance for two reasons. First, the problem size is reduced to pieces that can exploit faster memory types, and second the operation matrix-multiplication is highly parallel, and had already been optimised for the Cuda architecture. These tests also showed that when a kernel was invoked, using several thread blocks is better than just using one. One reason could be that the warp scheduler can utilise multiple SMs for solving the problem. Unrolling a loop together with Volkov’s suggestion for matrix-multiplication also helped increase performance. The last part is a bit surprising, because Volkov’s suggestion on matrixmultiplication actually lead to a performance decrease on system #3.

69

Other tests revealed that data prefetching and the tiling strategy did not actually increase performance, but left it without any major change. This fact promotes the notion described above, that memory is not the limiting factor. The correctness tests also confirmed that instructions have a higher degree of precision on CC v2.0 than on earlier versions. If I were to optimising LU-decomposition even further, I would focus on arithmetic optimisations. This could be achieved by among others focusing on unrolling loops, minimising control flow complexity and removing unnecessary synchronisation points.

70

8 QR decomposition
QR-decomposition, also known as QR-factorisation, is a decomposition of the with = ≥ , in the form: ×

×

matrix A,

Where Q is an

orthogonal matrix and R is an =

orthogonal matrix satisfies

×

upper triangular matrix. An

Which implies = There are different methods for calculating the QR factorisation, which can be used to solve linear systems and least squares problems [23][24].

8.1 Analysis
The different methods for decomposing matrix A into a QR factorisation include GramSchmidt, Householder reflections and Givens rotations. The classic Gram- Schmidt process is considered to subject to numerical instability. The modified Gram-Schmidt algorithm overcomes this numerical instability but at the expense of adding extra operations [23][25], I will for these reasons not consider the classic nor the modified version. Operations count analysis of both Householder reflections and Givens rotations show that Givens rotations require about 50% more operations than Householder transformation [26]. Besides that, Givens rotations rely heavily on sine and cosine instructions, which will be processed by the limited SFU. I have therefore decided to base the QR-decomposition algorithm on Householder transformations. But there are also other advantages; firstly, the parallelisation is similar to LU-decomposition, why I expect draw on the parallel optimised experiences from the LU-decomposition chapter [24]. Secondly, Householder QR can use a compressed data storage form, by using the original matrix A and an additional array for the diagonal values of R [23]. Consider the matrix A in

71

Figure 18, the nonzero part of the vectors matrix R.

are stored in A along with the upper triangular

Figure 18 - Storage strategy for the compressed Householder QR-factorisation

The diagonal of R is stored in an extra vector. If the actual be computed from this compressed representation [25].

or

is ever needed, they can

8.1.1 The sequential algorithm
The sequential algorithm consists of 6 loops that overwrite the existing matrix with the Householder vectors and the upper triangular matrix R. The diagonal elements of R are stored in the array d. The elements of the matrix are stored vectorised in the array qr. Pivoting can be used to ensure numerical stability, but has been left out for simplicity reasons.
1. // Core algorithm for QR Decomposition (Householder transformation) 2. for (unsigned int k = 0; k < n; k++) // For each column 3. { 4. // Compute 2-norm of k-th column 5. float sum = 0.0; 6. for (int r = k; r < m; r++) 7. sum += qr[r * n + k] * qr[r * n + k]; 8. 9. float nrm = sqrtf(sum); 10. 11. if (nrm != 0.0) 12. { 13. // Compute the kth Householder vector. 14. if (qr[k * n + k] < 0) 15. { 16. nrm = -nrm; 17. } 18. for (int i = k; i < m; i++) 19. { 20. qr[i * n + k] /= nrm;

72

21. 22. 23. 24. 25. 26. 27. 28. 29. 30. 31. 32. 33. 34. 35. 36. 37. 38. 39. 40. }

} qr[k * n + k] += 1.0; // Apply transformation to remaining columns. for (int j = k + 1; j < n; j++) { float s = 0.0; for (int i = k; i < m; i++) { s += qr[i * n + k] * qr[i * n + j]; } s = (-s) / qr[k * n + k]; for (int i = k; i < m; i++) { qr[i * n + j] += s * qr[i * n + k]; } } } d[k] = -nrm;

This implementation is based on the Householder QR factorisation algorithm, which in the central part has the following operation count per iteration [26]: 2∗( ( ( − )( − )

Dot products (Lines 7 and 30): Outer product (Lines 14-22): Subtraction (Line 35): Including the outer loop, the total running time is: 4( − )( − ) ~ 2

− )( − ) − )( − )

−2

/3

This shows that the running time is O( data size.

), and that the running time increases more than the

Pivoting can be used to increase numerical stability, but for simplicity, this has not been included in this implementation.

8.1.2 Parallelism
QR-decomposition share the similarity of the outer loop with LU-decomposition, meaning, the order of the outer loop is important and requires sequential processing. The algorithm can be divided into the following tasks: 73

1. 2. 3. 4. 5. 6. 7. 8. 9.

// Tasks in the algorithm for QR Decomposition for (int k = 0; k < n; k++) // For each column { // Task 1: Compute 2-norm of k-th column // Task 2: Compute the kth Householder vector. // Task 3: Apply transformation to remaining columns. }

Each of the tasks above can be performed with varying parallel degree. So there is a parallel potential for this algorithm, and the details will be described later.

8.2 Simple algorithm
A simple version of QR-decomposition was implemented and tested. The goal is to port the algorithm to the Cuda architecture as fast as possible. Later, when the implementation is functional, I will look at how to increase performance.

8.2.1 The algorithm
The sequential algorithm shown in paragraph 8.1.1 was implemented using regular C++, to target the CPU architecture. This implementation was used to in the correctness test. The GPU accelerated simple version was implemented based on the analysis from paragraph 8.1.2, and using the three identified tasks. To make the implementation of the three tasks as easy as possible, the same procedure is used for all tasks, namely each task is handled by a single thread block that holds 128 threads. The drawback is that when the problem size becomes smaller than the number of threads, which happens when < − 128 for task 1

and 2, there are generated a number of empty threads. This will only have an effect when the last rows and columns are being processed, and for large matrices this will constitute a relative small amount of the total running time. Task 1 - Two-norm The data size for each to /128. step is

sequential implementation, the running time is proportional to . In this version, one thread block with 128 threads processes any size, meaning the running time is proportional

=

− , which is the number of remaining rows. In the

Task 2 - Householder vector Each element in the Householder vector can be calculated independently, and this task processes the remaining rows. The data size for each step is = − . The sequential

74

implementation has a running time proportional to , while this version as task 1, has a running time proportional to d/128. Task 3 – Transform columns When task 1 and 2 have been performed, the rest of the matrix must be updated, which is the remaining columns and rows. Each column can be generated independently, and each thread of the 128, processes for each step 128 columns. ( − )

8.2.2 Test and results
Tests were performed with data sizes 400, 2000, 4000 and 6000, data sizes that are both height and width of the matrix being decomposed. Matrix elements are randomly generated.

0,70 0,60 0,50 Gigaflops 0,40 0,30 0,20 0,10 0,00 0 1000 2000 3000 Data size System #1 System #2 System #3 System #4 CPU 4000 5000 6000

All systems were used in the tests and the CPU was used to calculate a reference result. The CPU results indicates that the CPU is slower when the matrix gets bigger, this makes good sense as the CPU for large problem sizes is not able to exploit its caches. The performance of the GPU gets better when the matrix size increases up to about 2000, then the processing power performance evens out. System #3 is the only architecture that performs better than the CPU, so this specific implementation does not yield an acceptable performance on the GPU. The maximum difference from the CPU reference result was 0.000049 so the implementation is considered acceptable accurate.

75

GPU.NET QR-decomposition is not tested with GPU.NET for the same reasons as described in the LUdecomposition chapter.

8.3 Optimisation
Task 1 - Two-norm This task can be improved by using a parallel reduction algorithm. Doing so, make it possible to decrease the asymptotical running time to O(log Task 2 - Householder vector Each step has a data size of ).

number of processors. The maximum possible number of processes is , making the asymptotical parallel running time O(1) if a min. of asymptotical running time is O( ). Task 3 – Transform columns

independently, making the asymptotical parallel running time about / , where

implementation is proportional to. Each element in the Householder vector can be calculated is the − 1 processes is available, if not the

=

− , which the running time of sequential

The number of operations required to update the remaining columns and rows of the matrix equals: 2 ∗ Where represents the number of columns parallel running time is: 2 ∗ for each step is proportional to 2 ∗ Where − ∗ − that can be processed in parallel. So the

∗ , and

− − processors are available the running time is the number of rows.

is the number of processors. If a min. of

8.3.1 Test and results

2000 × 2000, 4000 × 4000 and 6000 × 6000 matrices.

With these optimisations implemented, the tests were performed again for 400 × 400,

76

1,60 1,40 1,20 Gigaflops 1,00 0,80 0,60 0,40 0,20 0,00 0 1000 2000 3000 Data size System #1 System #2 System #3 System #4 CPU 4000 5000 6000

The peak performance for system #3 reached 1.42 gigaflops, not as impressive as the results

approximately 1000 × 1000 , now benefits from being computed using the Cuda architecture. The optimisations have increased performance on all systems. Matrices larger than

achieved by matrix-multiplication and LU-decomposition, but still about 5 times as fast as the CPU. One of the reasons for this low performance is because the algorithm is not that suited for a parallel architecture.

8.4 Block QR-decomposition
The algorithm used so far, relied heavily on vector operations. Matrix operations are much better to exploit a parallel architecture, and such an algorithm has been designed by Susan Ostrouchov et al [27].

8.4.1 The block algorithm
By partitioning a × matrix A, the factorisation QR may be partitioned as shown [27]:

an × ( − ) matrix containing the remaining columns. ( − ) × and is ( − ) × ( − ). is an 77

Where × is the block size,

× matrix containing the first columns, and is × ,

is × ( − ),

is is

=

= using a series of Householder

The first step is to perform a regular QR factorisation on transformations of the form: = − is of length

Where = 1, … , . The vector the element is 1.

where the first − 1 elements is 0, and

=2 It can be shown that: = So in step 2, the triangular factor … = − is calculated. The triangular

of the block reflector

factor is used together with the transformation above, to update the remaining matrix. In the / number of steps the matrix A has been decomposed by using a block QR-

decomposition, as depicted here. The white parts have already been solved.

Figure 19 - Matrix A being decompose by block QR-decomposition in steps.

78

Figure 19 shows the

dimension of the current block being processed (here the green and purple sub-matrix). Step 1 QR-decomposes the green and purple sub-matrix by regular QR-decomposition, then the triangular factor is is used to transform the remaining columns in the cyan and blue submatrix. The steps are then continued for the remaining parts until the whole M x M matrix is processed. This algorithm makes it possible to partition large matrices and solve smaller parts, furthermore matrix operations are being used, that can utilise the parallel Cuda architecture.

×

matrix,

is the current step and block width is obviously the

8.4.2 Implementation
Implementation of this block algorithm has been challenging. The structure of the algorithm resembles the block LU-decomposition, which was implemented and performed well. This made me hope that the block QR algorithm also could be implemented and perform well. Unfortunately this has not been the case. Implementation of the algorithm was initially attempted for CPU processing. Thorough debugging and testing have revealed that most of the algrithm works, and generates the expected result. Regrettable, not all parts work as hoped. The poblematic part is related to this transformation: = above: ← This transformation is a matrix-multiplication between the transposed block of Householder vectors and the sub-matrix is then written to the × − , consisting of the remaining columns in the matrix A. The result matrix . ← . … = −

This transformation can be divided into three steps. Using explanation and figures from

Then the triangular

factor of the block reflector , should be computed. After which its

transpose should be matrix-multiplied with the existing matrix ← = −

79

When the final matrix W is computed, it is used in a matrix-multiplication with the block Householder vectors. The elements are then subtracted from the sub-matrix give . , which should

These steps should then be repeated until the complete matrix has been decomposed. But numerous attempts at calculating the triangular matrix and array contains wrong values. factor has failed, and the resulting

8.5 Evaluation
The algorithms for LU- and QR-decomposition have a similar structure, so ideas from the LU implementation was also applied to the QR implementations. Optimising the running time of the different tasks proved to increase performance with a factor of 3.84 times. Unfortunately, due to the lacking implementation of a block QR algorithm, no further tests were performance. This means that the full potential of QR on the Cuda architecture is still to be unfolded. Jack Dongarra, Susan Ostrouchov and others have designed this block QR algorithm. They are highly competent people that have made contributions to Eispack, Linpack, BLAS, Lapack and ScaLapack. The challenge with the rather than the algorithm. factor is more than likely related to my implementation

80

9 Evaluation
The optimisation strategy described some methods and techniques that could be applied when improving the implementation of the linear algebra algorithm. This evaluation paragraph will summarise the findings and evaluate on the strategy.

9.1 Cuda
The Nvidia profiler can show relevant counters for both arithmetic and memory performance. CGMA source code analysis can give valuable information about memory bandwidth as a limiting factor. The results from the tests suggest that a block linear algorithm is best suited for the Cuda architecture. Such an algorithm is designed to divide data into sizes that fit into caches, such as shared memory. When implementations are to be optimised, the findings from this project suggest that tiling is the best strategy, followed by latency hiding and coalescing memory access. With regards to coalescing memory access, it should be mentioned that GPU architecture designers are aware of the importance of this limiting factor, so newer GPUs are designed with built-in optimised memory access. The impact of non-coalesced memory access should therefore be of less importance in the future, and hence make porting of existing algorithms easier. In addition to the points above, here is a list with recommendations based on the findings of the tests performed in this project: • • • • • • Avoid using structures as parameters in the kernel definitions, use instead simple types or pointers thereof. Target the highest possible Compute Capability level. Among other things, the precision of instructions are better and the result will be more accurate. Unroll loops, by making the threads fine-grained. Generation and thread scheduling are cheap. Thread block size should be a multiple of the warp size (Currently 32). Be aware of the overhead for invoking a kernel. Note that default instructions deviate from IEEE 754, use specific IEEE 754 functions for increased precision, but at the cost of speed. Besides the list and suggestions above, there were also methods with doubtful results:

81

• •

The Volkov suggestion yielded performance gains on some systems, but lower on others. Can be useful for low occupancy kernels, but should be tested and evaluated. Data prefetching can both increase and lower performance.

The underlying hardware and its capabilities play an important role whether an optimisation technique affect performance. Some methods have positive effect on some GPUs, and a negative on others. Analysing and testing should therefore always be performed.

9.2 GPU.NET
GPU.NET v1.0.3.5 was not mature and suffered from several bugs. The number of problems makes it not recommendable for production environments. However, the latest release is v2.0.14, which solves many of the bugs and problems I encountered. The JIT compilation of kernels is a design decision that applies to all current versions of GPU.NET. A JIT compilation is cached in-memory, and subsequent calls from the same process will be served from this cache. It is therefore recommended, when using GPU.NET for large or numerous problems, to warm-up both Cuda and GPU.NET. Do this by calling the kernel with a small data size, subsequent calls will then be served faster.

82

10 Discussion and future work
This chapter begins with a discussion of the work and results in this project, and which fields could be further researched. Then, the future of Cuda is discussed in comparison with GPU code generation tools. After which a broader perspective on hardware development and GPGPU in general, is discussed.

10.1 Project
A more thorough correctness test and analysis could further clarify the numerical stability of the implementations used in this project. For example by comparing this projects results with results from the widely recognised Matlab. For this project I insisted on implementing all parts of the algorithms. A lot of work and research have gone into the development of standard math libraries supporting the BLAS interface. Implementing all parts gave valuable insight to the inner workings of the algorithms, but possibly at the expense of performance. Using these libraries (e.g. Cublas or Cula), could reveal the full performance potential of the different algorithms on the Cuda architecture. Testing performance, of other linear algebra algorithms, could serve as a frame of reference. For example, how would Givens rotations affect the performance of QR-decomposition instead of the Householder transformation method chosen? A more thorough analysis and testing of the QR block algorithm would also be beneficial. The optimisation strategy and the optimisation experiences could be applied on several other linear algebra algorithms. An obvious extension would be the Singular Value Decomposition (SVD).

10.2 Cuda
Cuda C and GPU.NET currently represents two different directions for utilising the Cuda architecture. Cuda C is C/C++ and complex, whereas GPU.NET is .NET, uses code generation, and is easier to use. You might say that GPU.NET is for developers that without too much trouble, wants to accelerate their applications using parallel architectures. Cuda C, on the other hand, is for developers that are not intimidated by C/C++ and tweaking. Cuda C offers more flexibility, which enables better optimisation and higher performance, but it does however not have to be a choice of either advantages. Cudafy.NET is a set of libraries and tools supporting both directions.

83

Cudafy.NET can be used in the same way as GPU.NET, using full code generation. But it can also just work as a bridge from .NET to Cuda C kernels. Cuda C optimisations are then possible, while the invocation is carried out by the .NET runtime. Uncovering the performance characteristics of Cudafy.NET e.g. using the linear algebra algorithms from this project could be another valuable next step. It is expected that Cuda will continuously be improved, e.g. by making the NVCC support C++ language features in kernels, allow better debugging in Nsight, and increase the language support features in IDEs, to make development smoother. With Cuda v4.0 the tools and drivers has been updated, and now enable a grid of machines and GPUs to work together to solve large problems. This makes Cuda able to solve even larger problems, than with former versions.

10.3 Hardware
The newer Cuda GPUs are becoming increasingly accurate, meaning the instructions are performed with better numerical precision, at even faster speeds. Double-precision instructions have been supported from Compute Capability 1.3, and it is expected this as well, will become more precision together with faster processing times. The future will surely also bring GPUs with even more cores and faster memory. Currently the architecture of the Nvidia GF100 chips support up to 512 cores, but the dedicated GPU computing system Tesla S2050 have 4 GPU with a total of 1,792 cores. Nvidia is not only player when it comes to GPGPU. AMD has the FireStream architecture, and the top model FireStream 9370 has 1,600 cores delivering 2,640 gigaflops. Looking at the latest “TOP500 supercomputers list”, out of the top 5 the 3 are using Nvidia GPUs. So Nvidia is a strong player, and I expect Nvidia and Cuda to play an important role in the GPGPU field in the future.

10.4 Future of GPGPU
GPGPU development has for a long time been limited to first movers that saw a potential in the high processing power that GPUs offer. Currently GPGPU is often used, where many computations are needed. For example simulations of fluid or weather forecasting, or the prediction of protein folding used by the pharmaceutical industry. But another and more subtle application is slowly emerging.

84

A graphics card with a high performing GPU is a relatively cheap commodity, and many regular computer systems are today equipped with a high performing GPU. Some application developers have spotted this opportunity and now allow their application to be optionally accelerated by the GPU. This is often completely transparent to the end user, but delivers an increased application response time, which gives the user a better experience. Applications that currently exploit this possibility includes, but are not limited to, different browsers, such as Internet Explorer, Chrome and Firefox, and different video editing applications.

85

11 Conclusion
Three frequently used linear algebra algorithms for matrix-multiplication, LU- and QR decomposition was decided on for this project. They were described, analysed, and then initially implemented using C/C++ for the CPU architecture. The Cuda architecture and development platform was subsequently analysed and described. Important features, characteristics and limitations were uncovered and an optimisation strategy was formed. Based on the analysis of the linear algebra algorithms and Cuda, implementation procedures were designed. Then the algorithms were implemented targeting the Cuda architecture and using C/C++ and Cuda C, after which they were tested. During this process different findings were learned, which was subsequently used in combination with the Cuda optimisation strategy to improve performance. GPU.NET was used, where applicable, as a perspective on how to use Cuda from .NET. Correctness tests were performed by comparing the results from the CPU with the results from the GPU. The maximum differences documented the accuracy of the different algorithms processed on various systems and GPUs. The learning goals have all been achieved and the complete process has been documented in this report.

86

Bibliography and references
1. Mikkel Bundgaard-Ovesen, Documentation of the GPUs usability in advanced parallel calculations, 15th December 2010 2. Nvidia Cuda, Nvidia Cuda C Programming Guide, 9. November 2010 3. Desmond Fearnley-Sander, Hermann Grassmann and the creation of linear algebra, December 1979 4. David B. Kirk and Wen-mei W. Hwu, Programming Massively Parallel Processors, 2010 5. Jason Sanders and Edward Kandrot, Cuda by example – an introduction to GeneralPurpose GPU Programming, 2011 6. Vasily Volkov, Better Performance at Lower Occupancy (slides), 22nd September 2010 7. Vasily Volkov, Use registers and multiple outputs per thread on GPU (slides), 30th June 2010 8. Geekbench, Performance of an Intel Pentium 4 3.06 GHz running Linux, Downloaded 3rd June 2011 (http://browse.geekbench.ca/geekbench2/view/209683) 9. Nvidia, Cuda C Best practices Guide, 20th September 2010 10. Paulius Micikevicius (Nvidia), Analysis-Driven Optimization (slides), 14th November 2010 11. Sara Robinson, Toward an Optimal Algorithm for Matrix Multiplication, November 2005 12. Ananth Grama et al., Introduction to Parallel Computing, 2nd edition, 26th January 2003 13. Mary Jane Sterling, Linear Algebra for Dummies, 2009 14. Brian W. Kernighan and Dennis M. Ritchie, C Programming Language, 2nd edition, 1st April 1988

87

15. John J. Barton and Lee R. Nacnman, Scientific and Engineering C++: An Introduction With Advanced Techniques and Examples, 19th August 1994 16. Jens Eising, Lineær Algebra, 1999 17. G. W. Stewart, Afternotes on Numerical Analysis, 1996 18. E. E. Santos and M. Muraleetharan, Analysis and Implementation of Parallel LUDecomposition with Different Data Layouts, June 2000 19. Prof. Michael T. Heath, Parallel Numeric Algorithms: LU-Decomposition (slides), 2010 20. Vasily Volkov and James W. Demmel, Benchmarking GPUs to Tune Dense Linear Algebra, November 2008 21. Vasily Volkov and James W. Demmel, LU, QR and Cholesky Factorisations using Vector Capabilities of GPUs, 2008 22. Jack Dongarra et al., Derivation of a Block Algorithm for LU Factorization, 9th February 1997 23. Peter J. Olver, Orthogonal Bases and the QR Algorithm, 5th June 2010 24. Prof. Michael T. Heath, Parallel Numerical Algorithms: QR-Factorization (slides), 2010 25. Walter Gander, Algorithms for the QR-Decomposition, April 1980 26. Radu Trîmbitas, Householder Reflectors and Givens Rotations: Why orthogonality is fine, 11th March 2009 27. Susan Ostrouchov, QR Factorization (a block algorithm), 28th April 1995

88

Appendix A – Project evaluation
The initial problem definition about linear algebra algorithms was updated during the project period. In agreement with my supervisor Peter Sestoft, we decided to focus on linear algebra algorithms for matrix-multiplication, and LU- and QR-decomposition. This clarification made me able to focus on analysing Cuda features and limitations. My assessment is that this elucidation made it possible to uncover Cuda characteristics in a details, that else was not possible. I am satisfied with the result of the project. The problem definition and learning goals were all fully met, and the process and findings are all described in the report. But it was not everything that went without challenges, let me elaborate. To be able to implement an algorithm, a full comprehension of the algorithms and its inner workings is necessary, this showed to be severely complicated with regards to LU- and QRdecomposition. In addition to the linear algebra complications, add the difficulty of using a new development architecture and programming language. I can best describe this by comparing it to building a house of cards. The implementation phase is represented by the last third of the house. So before being able to build the top of the house, one needs to build the first 2/3, and before that, one needs to determine where the house should be based. My initial lack of knowledge of linear algebra, C and C++ meant that many resources were invested in learning and gaining abilities. In spite of the initial research phase, I did encounter situations during the project period, where my knowledge still did not suffice. As mentioned above, this applied specifically to LU- and QR-decomposition. With QR it was specifically the block algorithm that was difficult to comprehend. During the 6 months that the project period lasted, I did learn a great deal. Learning goals covering linear algebra and algorithms, together with Cuda, C and C++, defined the areas in which I wanted extent my knowledge. As mentioned, I had only minor experience and no qualifications in the fields prior to this project. So the learning requirement was high, and the learning curve was steep, but I am satisfied with the result and the knowledge I have gained will be beneficial in the future.

89

Appendix B – Implementation considerations
When the Cuda platform is utilised for processing, a new computing environment is introduced into development and runtime. The host refers to the code and memory of the CPU and the device refers to the code and memory of the GPU. The code and functionality that exhibit little or no data parallelism are implemented in host code. The code and functionality that exhibit rich amount of data parallelism are implemented in the device code [4]. The host and device are two runtime environments that work independently. Communication between the host and device is obviously necessary, as else, the CPU would not be able to harnessing the GPU power of the Cuda architecture. In Cuda, the host is responsible for this communication, which includes structuring data, allocate and releasing memory on device, copy data to and from device as well as invoking the device kernel. In addition to this, the host is responsible for configuring the device execution environmental settings. Basically, specifying the number of threads the architecture should spawn to solve a problem. The Cuda architecture allows, as shown in the following figure, threads to be organised in blocks, and blocks to be organised in a grid.

Figure 20 - Cuda thread organisation [4]

90

Cuda thread organisation
A kernel is mapped to a grid, which is organised by blocks in two dimensions and a block can hold threads in three dimensions. In the device kernel a block and thread is identified by the following built-in variables:

Variable gridDim.x

Description Holds the number of blocks in the first dimension of the grid. Values are valid in the range 1-65535.

gridDim.y

Holds the number of blocks in the second dimension of the grid. Values are valid in the range 1-65535.

blockDim.x

Holds the number of threads in the first dimension of the block. Values are valid in the range 1-512.

blockDim.y

Holds the number of threads in the second dimension of the block. Values are valid in the range 1-512.

blockDim.z

Holds the number of threads in the second dimension of the block. Values are valid in the range 1-64.

blockIdx.x

Hold the current blocks first dimension position in the grid. Values are valid in the range 1-[gridDim.x].

blockIdx.y

Hold the current blocks second dimension position in the current grid. Values are valid in the range 1-[gridDim.y].

threadIdx.x

Hold the current threads first dimension position in the current block. Values are valid in the range 1-[blockDim.x].

threadIdx.y

Hold the current threads second dimension position in the current block. Values are valid in the range 1-[blockDim.y].

threadIdx.z

Hold the current threads third dimension position in the current block. Values are valid in the range 1-[blockDim.z].

Table 10 - Cuda built-in variables

Why has Nvidia designed a thread structure in up to five dimensions? Would it not be easier to just use a single dimension?

91

For simple algorithms that only require a thread structure in one dimension, this can be achieved. But there exists problems that naturally belong to a space of two dimensions or more, e.g. a matrix. This structure is optional only, meaning the developer, and some hardware limitations, decides how many dimensions to be used. The total number of threads is a result of the following: ℎ Where = . ∗ . ∗ . ∗ . ∗ .

block GPU constraint. This is for most GPU’s 512. [5].

. ∗

. ∗

. cannot exceed the total number of threads per

The size of the grid and blocks is often defined directly in the source code, but the optimal size is in many cases directly dependent on the data size. This is not very flexible, as it means that grid and block size would have to be adjusted, in the source code, for different data sizes, and afterwards recompiled before execution. There are different solutions to this. One way is to set the number high to cover most cases. In the kernel one would have to check if the current thread actually has data to process like so in line 6:
1. __global__ void kernel(float *data, int dataSize) { 2. 3. // Thread ID 4. int tid = threadIdx.x + blockIdx.x * blockDim.x; 5. 6. if (tid < dataSize) { 7. 8. // Process data 9. } 10. }

This is inefficient as many threads will be spawned but without any actual data to process. Another way is to define the number of threads per block (e.g. 128), and then calculate the number of required blocks from the data size. This makes sure that at most (threadsPerBlock1) threads are created without any data to process. A third way is to calculate the grid and block size dynamically from the data size; this is however difficult as the optimal setting is influenced by both the data size and the structure of the algorithm.

92

Either of the second or third method can prove feasible, they both have pros and cons, but which specific method to use, should be determined on a case by case basis.

SIMT and warp size
As mentioned earlier, threads are organised in blocks. But this is not the only organisation; each block is partitioned into warps. A warp is a bundle of 32 threads being executed in parallel. These threads share a single instruction set, hence Cuda is a Single Instructions Multiple Threads, also abbreviated SIMT, architecture. This is a design decision to reduce hardware cost and to enable optimisations techniques, and it is not without relevance to the developer. The SIMT architecture has some implications that will be discussed later. The size of the warp is another important aspect to take into account when defining the grid and block size. Consider the example where a problem is organised into 20 blocks each with 10 threads, giving a total of 20 x 10 = 200 threads. Cuda executes 32 threads in a warp in parallel. In the example above, only 10 threads are available per block. Cuda will in this case fill up the warp with 22 empty threads, resulting in 20 x 22 = 440 empty threads being created. The block size should theoretically be defined to a number dividable by 32 [4].

Elapsed time
Measuring elapsed time is essential to measuring performance. Normal event timing in C and C++ is CPU based, which is insufficient when dealing with the GPU. The GPU and CPU are physically two independent processors, which run in parallel. The Cuda toolkit provides an API for measuring GPU events and elapsed time. The Cuda API will be used to measure memory allocation, copy of data from host to device, the kernel execution time, copy of data from device to host and the release of memory. These different timers will not just give the elapsed times of different operations, but actual valued insight to the GPU performance. It will for instance be possible to calculate memory transfer rates as well actual peak performance in gigaflops of the kernel. In addition to valued insight, the timings can be used to measure relative performance gains or losses, when certain properties or capabilities of the Cuda architecture have been applied to the algorithms. In addition to measuring relative performance, the GPU timing will serve as a base for comparison with the similar linear algebra processes on GPU.NET and the CPU.

93

Pinned or page-locked memory
A program that uses Cuda to harness the power of the GPU normally follows these steps: 1. Initialise 2. Copy date from host -> device 3. Process data on device 4. Copy data from device -> host 5. Release The kernel has been the focus for optimisations and analysis so far, but there are other ways of optimisation a program using Cuda. By using pinned or page-locked memory, higher data transfers can be achieved between host and device. On platform #1 the speed of memory transfer could be increased from about 1.5 GB/sec to 5 GB/sec. But caution should be exercised when pinned memory is used; excessive use can reduce overall system performance as page-locked memory is scarce [9].

Matrix structure
Matrices are mathematical structure in two dimensions. In the computer memory this can either be represented by 2-dimensional array or an array of arrays. Even though 2dimensional structures are available in computer memory, it is better to vectorise the matrix, by aligning the rows after each other. Accessing a specific value in the vector of matrix A, is performed like so: v[3 * Width + 2]. Where v is the vector of matrix A, and Width is the column count of A. The Cuda architecture is designed to be stream based, so by vectorising data for processing on the GPU platform, one uses Cuda as it was designed and intended. For the code I use the following matrix structure to hold the vector and details about the matrix.
1. typedef struct 2. { 3. float *n; 4. unsigned int width; 5. unsigned int height; 6. unsigned int size; 7. } matrix;

n is the pointer to the vector of float values, width is the number of columns, height the number of rows and lastly the size if the length of the vector (height*width).

94

Appendix C – Hardware specification description and analysis
In the following the specifications of the different machines will be described and evaluated in terms of Cuda capabilities. The speed of the GPU and memory are measured and from that, the memory bandwidth and gigaflops are calculated. The GPU has historically been designed for performance and not precision, hence all gigaflops calculations are based on single precision float operations. It is not until Cuda compute capability (CC) 1.3 that double precision were supported, but with a significant performance hit. The following specifications are based on information and measurements by CPU-Z, GPU-Z and Cpu Caps Viewer, as well as information about bus, FSB, PCI-E and more. The details are meant to give a theoretical upper limit on performance, which can be used for comparison with the results of the tests.

Platform #1
Apple Macbook 13” with Intel Core 2 Duo P8700 2,53 MHz processor, 4GB DDR3 ram on 533 MHz, a Nvidia GeForce 9400m and a Front-side-bus (FSB) on 1066 MHz. The GPU on the machine has the following specifications:

Cores Memory interface Memory bandwidth (internal/external) Graphics bus interface (PCI-E v2.0) Transistors Core clock Shader Clock Memory Clock Gigaflops

16 128-bit 8GB/sec, 16,6 GB/sec 8 GB/sec 282 Million 450 MHz 1100 MHz 1066 MHz (533 MHz double pumped) 51,56

95

Cuda Compute Capability

1.1

Table 11 - GPU specifications for Nvidia GeForce 9400m, platform #1

Platform #2
Apple iMac 24” with Intel Core 2 Duo E8435 3.06 GHz, 4 GB DDR2 ram on 399 MHz, a Nvidia GeForce 8800 GS and FSB on 1066 MHz. The GPU on the machine has the following specifications:

Cores Memory interface Memory bandwidth (internal) Memory bandwidth (external) Graphics bus interface (PCI-E v1.1) Transistors Core clock Shader Clock Memory Clock Gigaflops Cuda Compute Capability

96 256-bit 49,94 GB/sec 6,23 GB/sec 8 GB/sec 754 million 500 MHz 1250 MHz 800 MHz 234,38 1.1

Table 12 - GPU specifications for Nvidia GeForce 8800 GS, platform #2

Platform #3
Is a machine with a Nvidia Tesla C1060 GPU. The exact machine specifications have not been available, however the specifications for the C1060 GPU gives some hints on the performance.

Cores

240

96

Memory interface Memory bandwidth (internal) Transistors Core clock Shader Clock Memory Clock Gigaflops

512-bit 102,4 GB/sec 754 million 602 MHz 1300 MHz 1600 MHz 933,12 for Total(Mul+Add+Special Function) 622,08 for Total(Mul+Add)

Cuda Compute Capability

1.3

Table 13 - GPU specification for Nvidia Tesla C1060, platform #3

Platform evaluation
You may wonder why platform #3 has two different gigaflops. The first is based on the specifications of the G80 and the descending architectures, which says that a GPU is capable of performing a Multiply-Add instruction dual-issued with a special function instruction per operation cycle. The second is based on the newer Fermi architecture specifications, in which a operation cycle can perform a Multiply-Add instruction dual-issued. That a newer architecture supposedly is slower than an older one, not only contradicts the logic of development and improvement, but it is not so. The G80 based architectures are equipped with streaming processors (SP) and separate special function units (SFU). The SP combined with the SFU gives theoretically 3 operations per clock cycle; however basing a gigaflops calculation on these specifications makes the result very theoretical. Calculating the gigaflops performance according to this may be correct, but does not yield an achievable result. The reason is surely a result of Nvidias competition with other GPU manufactures, to produce a GPU with the highest gigaflops count. Most development and testing are performed on platform #1, so this platform will serve as a base.

97

Specifications
This paragraph will dig a little deeper into the specifications of the hardware, and describe the theoretical performance limits. When dealing with GPU’s the most important are memory transfer rates and GPU gigaflops. The relevant elements in question are chipset, front-sidebus (FSB), memory speeds and the GPU.

Figure 21 – Block diagram of a chipset. Source: Intel

Chipset The chipset consists of a north- and a south bridge. The north bridge is responsible for handling the exchange of data between the CPU, memory and the graphics adapter. The south bridge handles exchange of data with external devices like audio, network, hard discs and USB devices. The north bridge is the most data intensive and relevant for this project, whereas the south bridge is not in used for GPU accelerated applications. The bus speed of platform #1 is 266 MHz with a multiplier of 4, making the rated FSB about 1066 MHz. The width of the bus is 64-bit making the transfer rate:

98

/ The transfer rate is then 8.33 GB/s.

=

∗ 8 ∗ 1024

Figure 22 - CPU and bus details of platform #1

Memory Memory transfer occurs when data is copied from host to device and again when the result is copied from device to host. This data is transferred via the chipsets north bridge from the CPU/system memory to the device memory. The GPU of Platform #1 has no dedicated memory and uses the system memory. The transfer rate is of that reason equal to that of the system memory. The system memory consists of two DDR3 modules whose peak transfer rate is double that of the FSB (double data rate), meaning 16.66 GB/s.

99

Figure 23 - System memory details of platform #1

Grahpics adapter The software GPU-Z reports the GPU of platform #1 to run on a PCI port. But this cannot be true as the transfer rate would be about 2 GB/sec. My guess is, as the Nvidia specification says, it runs on a PCI Express 2.0 bus interface with a peak transfer rate of 8 GB/s one way, which by the way is the same as the memory transfer.

100

Figure 24 - GPU details of platform #1

The gigaflops count is calculated by: = ∗ ∗ 1024

Platform #1 has 16 cores with a shader speed of 1100 MHz. The number of operations are theoretically 3 (Mul+Add+SF), which in terms result in a gigaflops count of 51.56.

Evaluation
Development and testing will mainly be performed on platform #1 and #2, even though they lack the extreme computing power platform #3 posses. The purpose of this project is to test the applicability of GPGPU for solving different problems, and the focus is furthermore on testing relative performance gains or losses of different optimisations techniques. Platform #3 will however give an important insight into the performance of solving these problems on a massive parallel architecture. Platform #3 furthermore supports a higher compute capability, which makes even more optimisation techniques available, as well as double precision operations.

101

Appendix D – Development environment problems and solution model
Making VS2008 ready for Cuda development was a challenge. The following is a description of the problems experience, and the solution used.

Development model
Cuda toolkit version 3.2 supports Microsoft Visual Studio 2005 (VS2005) and Visual Studio 2008 (VS2008). It is possible to enable development in Visual Studio 2010 (VS2010), but has proven difficult to setup. This is among other reasons, due to the fact that the Nvidia Cuda compiler (NVCC) requires either a Visual C++ version 8 or 9 compiler. I have tried to set Cuda up for VS2010, but the trouble have led me to the conclusion, that the problems and minor inconveniences far exceed any gains achieved by using VS2010. The Nvidia GPU computing SDK, a separate package, provides help, tutorials, utility helpers as well as code examples. With this package, all the hard work of configuring Visual Studio, setting up paths and environment variables are done for you. However with it follows libraries packed with utility and helper functions and references to other libraries. Performance is important in this project, and there is no possibility to say what impact any reference libraries or any utility functions might have, which is why I have decided to create a new and clean project model, that can serve as a base for the performance tests in Cuda. By doing so, I get valuable insight of the structure of the toolkit and its applicability.

Cuda C and C++
Cuda code however has to be written in Cuda C, a language based on ANSI C. Host code on the other hand does not necessarily need to. The language C has evolved since its initial release, and C++ provides new features and updated libraries. It can therefore be desirable to use a mix of C and C++ when coding for the host, while device code must be written en C. In the project model C++ code must be contained in .cpp files and C for Cuda in .cu files. With the correct configuration it is possible to make NVCC compile .cu files and Visual C++ 9.0 compile the rest. The compilation linker’s responsibility is then to link the compiled objects and functions into a single executable file.

102

A description of what steps I had to take, and the project model can be found on my blog: http://blog.ovesens.net/2011/05/cuda-v3-2-template-project-using-cpp/

103

Appendix E – CGMA and Cuda profiler
Being able to optimise requires event measuring or profiling. But event measuring or profiling required knowledge of what to profile. The Nvidia paper on “Analysis Driven Optimization” [10] identifies four categories of what can limit a kernels performance; memory throughput, instruction throughput, latency or a combination of the above. To achieve the best performance it is important to strike a perfect balance between instructions:bytes ratio, also called compute to global memory access (CGMA) [4]. Two methods should be applied when trying to determine any optimisation possibilities, source code analysis and tool profiling. By looking at the source code, the developer can analyze whether a kernel is memory or instruction bound, and whether the ratio between these two is limiting the performance. To measure events, Nvidia provides a tool called “Compute Visual Profiler” that provide different counters. Different profile counters are available for GPU’s of different CC levels, a complete list and description can be found in the “Compute Visual Profiler User Guide”. The higher the compute capability the more detailed and accurate counters, but that does not mean that this projects development hardware, with compute capability of 1.1, does not provide any counters with valuable insight. These are shown and described in the following table.

Counter divergent branch

Description Number of divergent branches within a warp. This counter is incremented by one if at least one thread in a warp diverges (that is, follows a different execution path) via a data dependent conditional branch. The counter is incremented by one at each point of divergence in a warp.

instructions gld uncoalesced

Number of instructions executed. Number of non-coalesced global memory loads. Number of noncoalesced global memory loads.

gld coalesced gst coalesced

Number of coalesced global memory loads. Number of coalesced global memory stores.

104

local load

Number of local memory load transactions. Each local load request will generate one transaction irrespective of the size of the transaction.

local store

Number of local memory store transactions; incremented by 2 for each 32-byte transaction, by 4 for each 64-byte transaction and by 8 for each 128-byte transaction for compute devices having compute capability 1.x. It is incremented by 1 irrespective of the size of the transaction for compute devices having compute capability 2.0.

Table 14 - Selected profile counter from Compute Visual Profiler User Guide

These profile counters can give valuable insight to what a kernel actually do, but they cannot be used without consideration, Nvidia writes the following: Compute Visual Profiler values are best used to identify relative performance differences between un-optimized and optimized code. But holding the profiled numbers together with analysed numbers presents a good estimate of how much bandwidth is wasted by suboptimal coalescing of memory access [9].

105

Appendix F – Matrix-multiplication CC levels
Kernel Resulting matrix Tiled v1 Tiled v2 Latency hiding Tiled v3 Prefetching Tiling v4 2 outputs/thread Tiling v5 4 outputs/thread CC 1.1 7.80 gigaflops 19.12 gigaflops CC 1.3 7.80 gigaflops 19.11 gigaflops CC 2.0 14.76 gigaflops 19.86 gigaflops

31.97 gigaflops

31.95 gigaflops

33.57 gigaflops

32.83 gigaflops

32.84 gigaflops

34.57 gigaflops

33.46 gigaflops

33.43 gigaflops

36.90 gigaflops

39.47 gigaflops

39.43 gigaflops

37.06 gigaflops

Results from the matrix-multiplication compute capability levels test on platform #4.

106

Appendix G – Report page count
The number of pages in this report is estimated from the following rule: • 2400 characters constitutes a normal page

Characters:

This report hols in total 142,222 ~ 59,26 pages

107

You're Reading a Free Preview

Download
scribd
/*********** DO NOT ALTER ANYTHING BELOW THIS LINE ! ************/ var s_code=s.t();if(s_code)document.write(s_code)//-->