This action might not be possible to undo. Are you sure you want to continue?

Welcome to Scribd! Start your free trial and access books, documents and more.Find out more

Thang M. Le

Abstract

Equipped with hundreds of small cores, a graphics processing unit (GPU) has drawn broad attention to its massively parallel computing. The GPU is originally designed to excel in graphics applications. According this design, most of the computations are natively done at the hardware layer, hence, exposing very limited programming APIs. With the advent of general purpose computing on GPU (GPGPU) where GPUs take more computationally intensive tasks from CPUs, the design of GPUs has moved toward allowing applications to interface with the hardware through more programming APIs. The past few years have witnessed a number of general purpose applications use these APIs to utilize the GPU resource in order to boost their performance by orders of magnitude. Endorsing this movement, I will report my case study on accelerating graph algorithms on the GPU. This study was inspired by the paper “Accelerating Large Graph Algorithms on the GPU Using CUDA” by Pawan Harish and P.J. Narayanan. The authors discussed various common graph problems and proposed the corresponding parallel algorithms suitable to run on the GPU. This report will focus on the single source shortest path (SSSP) problem. In order to effectively evaluate the performance advantage of the GPU over the CPU, I have implemented two algorithms. The first algorithm is the parallel SSSP algorithm described in the paper. The second one is Dijkstra’s algorithm, a well-known algorithm in literature for the SSSP problem. Both implementations were mainly written in C language with the mix of CUDA parallel programming language provided by NVIDIA to leverage the GPU resource for the parallel SSSP algorithm. The result shows on large graphs, the parallel SSSP algorithm outperforms Dijkstra’s algorithm by a factor of 100.

**The Parallel SSSP Algorithm
**

The algorithm was designed to exploit the massive data parallelism of a graph by assigning each thread to a vertex. This means a larger an input graph is, a higher degree of parallelism the algorithm can achieve. Hence, the parallelization is limited by the number of concurrent threads supported by the device. Tesla C2050 has 14 stream multiprocessors where each stream multiprocessor can have maximum of 1536 concurrent threads. In theory, this results in as many as 21504 concurrent threads. Implementing the parallel SSSP algorithm is as easy as implementing Dijkstra’s algorithm. The algorithm only computes shortest distances from a single source. I have made a small change to additionally retrieve shortest paths from the source. This requires a new kernel method to keep track of node’s parent. As a result, there are three kernel methods which will be executed on the GPU (a.k.a the device).

Kernel method 1:

__global__ void cuda_sssp_kernel1(int *Va, int *Ea, int *Wa, int *Ma, int *Ca, int *Ua, int *Pa, int *empty, int vertices, int nedges) { int tid = blockIdx.x * MAX_THREADS_PER_BLOCK + threadIdx.x; int i, start_pos, end_pos; if( tid < vertices) { if(Ma[tid] == 1) { start_pos = Va[tid]; end_pos = start_pos ; if(tid < vertices - 1) { if(Va[tid] < Va[tid + 1]) { end_pos = Va[tid + 1]; } } else { end_pos = nedges; } for(i = start_pos; i < end_pos; i++) { if(Ua[Ea[i]] > Ca[tid] + Wa[i]) { atomicMin(&Ua[Ea[i]], Ca[tid] + Wa[i]); } } } } }

Kernel method 2:

__global__ void cuda_sssp_kernel2(int *Va, int *Ea, int *Wa, int *Ma, int *Ca, int *Ua, int *Pa, int *empty, int vertices, int nedges) { int tid = blockIdx.x * MAX_THREADS_PER_BLOCK + threadIdx.x; int i, start_pos, end_pos; if( tid < vertices) { if(Ma[tid] == 1) { Ma[tid] = 0; *empty = 1; start_pos = Va[tid]; end_pos = start_pos ; if(tid < vertices - 1) { if(Va[tid] < Va[tid + 1]) { end_pos = Va[tid + 1]; } } else { end_pos = nedges; } for(i = start_pos; i < end_pos; i++) { if(Ua[Ea[i]] == Ca[tid] + Wa[i]) { Pa[Ea[i]] = tid; } } } } }

Kernel method 3:

__global__ void cuda_sssp_kernel3(int *Va, int *Ea, int *Wa, int *Ma, int *Ca, int *Ua, int *Pa, int *empty, int vertices, int nedges) { int tid = blockIdx.x*MAX_THREADS_PER_BLOCK + threadIdx.x; if( tid < vertices) { if(Ca[tid] > Ua[tid]) { Ca[tid] = Ua[tid]; Ma[tid] = 1; *empty = 0; } Ua[tid] = Ca[tid]; } }

As you might have already noticed, the logic of kernel method 2 is almost the same as the logic of kernel method 1. The reason to keep them separated is to prevent racing condition when running on the GPU. In addition, readers should be advised that the above implementation might produce cycles among parents (i.e. parent of vertex A is B and parent of vertex B is A) if there are edges with weight = 0. Testing the implementation of the parallel SSSP algorithm to ensure it always produces the expected results is important. Achieving a high degree of parallelism comes with the risk of racing condition. Especially, the algorithm is very likely to run into concurrency issue in kernel method 1 where many GPU threads try to update the value stored in the shared memory simultaneously. To verify the consistency of the output, I have embedded some verification logic in the program to calculate the distance of a shortest path in calculate_shortest_path_distance() method and use this distance to compare with the shortest distance of the path calculated by the algorithm. Another effort was testing the program on small graphs then comparing the results with the results from Dijsktra’s algorithm. For large graphs, I used the diff utility in Linux to compare the results of the two algorithms with each other.

Performance Result

The performance was done on Tesla server equipped with two CPUs Intel Xeon DB Quad Core E5620 2.4 Ghz, 4 NVIDIA Tesla C2050, 24GB 1333MHz ECC DDR3 of memory and two 600GB Segate Cheetah

15000rpm 16MB Cache with a RAID controller. The testing environment is supervised by Professor Edward Chan at University of Waterloo. The test data is the road map of Connecticut State containing 161595 nodes with 1306818 edges. Even though the server has multiple GPUs, the parallel SSSP algorithm was implemented to run on one GPU resource only. Distributing tasks across multiple GPUs will be the future work. The table below shows the performance benchmark of the two algorithms.

#Vertices Max Min Average Dijkstra Dijkstra SSSP SSSP outgoing outgoing outgoing iterations runtime iterations runtime degree degree degree (s) (s) 5596 10 2 4.738 1181 0.0140 60 0.00322 19348 10 2 4.645 4165 0.130 102 0.00571 40792 12 2 5.007 8146 0.524 194 0.0112 76692 12 2 5.112 15001 1.660 199 0.0158 748014 130 2 7.825 94972 64.188 636 0.138 1306818 130 2 8.086 160963 176.851 605 0.223 Figure 1. Performance Benchmark - The Parallel SSSP Algorithm vs Dijkstra’s Algorithm #Edges

1181 4165 8146 15001 95581 161595

Accordingly, on the large graph, the parallel SSSP algorithm on the GPU performs 500 - 700 times faster compared with Dijkstra’s algorithm runs on the CPU.

#Vertices #Edges Kernel method 1 Kernel method 2 (max/average/min) (max/average/min) runtime (s) runtime (s) 0.00003/0.00001/0.00001 0.00002/0.00001/0.00001 0.00015/0.00002/0.00001 0.00002/0.00001/0.00001 0.00003/0.00002/0.00001 0.00003/0.00002/0.00001 0.00005/0.00003/0.00001 0.00006/0.00002/0.00001 0.00022/0.00009/0.00001 0.00023/0.00009/0.00001 0.00034/0.00016/0.00002 0.00034/0.00016/0.00002 Figure 2. Kernel Method Runtime Kernel method 3 (max/average/min) runtime (s) 0.00001/0.00001/0.00001 0.00001/0.00001/0.00001 0.00001/0.00001/0.00001 0.00002/0.00002/0.00002 0.00003/0.00002/0.00002 0.00003/0.00003/0.00002

1181 4165 8146 15001 95581 161595

5596 19348 40792 76692 748014 1306818

From figure 2, we can see that kernel method 1 and kernel method 2 have relative running time. The performance of these two kernel methods gets worse when the input graph gets larger. Kernel method 3 performs consistently over different input graphs. The future work would be to reduce the running time of kernel methods 1 and 2.

Conclusion

This study reiterates the importance of parallelism in computing. Exploiting data parallelism results in a significant performance improvement. The degree of parallelism in input data is enormous. In this type of computations, using multithreading on the CPU fails to achieve satisfactory results. It is because multithreading on the CPU is costly due to the context switch overhead. In addition, developers are required to fork/join threads in the program. In contrast, GPU computing is a well suited platform for processing data parallelism. A GPU containing hundreds of small cores can be considered a small cluster of CPU machines. There is no overhead when performing context switches in GPUs. Asynchronous concurrent execution is natively supported by the hardware. There is no need for an effort to manage threads in the program. With all of these advantages, GPU computing is suitable for data intensive tasks. This reminds us the MapReduce framework which is also used for processing data intensive tasks. More and more companies are employing MapReduce clusters of thousands of nodes to catch up with their exponential grow of data. Using GPU computing in MapReduce clusters would help to significantly reduce the size of the clusters while still achieve a better throughput.

- Toward an Efficient Algorithm for the Single Source Shortest Path Problem on GPUs
- Proportional Fairness in Cloud Networks with Two-Layer Queuing Model
- Accelerating CUDA Graph Algorithms at Maximum Warp
- Assessing StreamIt SPADE
- The Future of MapReduce In Data-Intensive Computing
- Various Comparisons for the LUP
- A Note on Goldberg and Rao Algorithm for the Maximum Flow Problem
- A Note on Goldberg and Rao Algorithm for the Maximum Flow Problem
- Java Feels Secure
- Project Linux Scheduler 2.6.32
- Study 2 POSIX RealTime Extension
- A Study on Windows Mobile 6.5 Operation System

- Visual Basic Tutorial
- Chap2 Slides
- lec13_x86Asm
- Java Arrayasdf
- Algorithm Project Report-1
- History
- Readme
- Tuning C Source Code for the Blackfin Processor Compiler
- SystemC Primer 1_1
- How to code for accelerometer and Core Location? | DCC 2009
- Linq
- Readme
- Points in C++
- C and C++
- Features of Java
- C C++
- MATLAB Chapter
- a Interview Questions-2
- Xc Sysgen43
- 2 Marks Final
- Introduction to Java Fundamentals
- Informatica-Interview Questions-
- Lecture 6
- Oracle DBA
- BWGKC_Allerton_09
- Lingo Users Manual 3
- Linq Intro
- Introduction to Programming Using Python - Programming Course for Biologists (Pasteur Institute, 2007)
- Intrview Questions

Are you sure?

This action might not be possible to undo. Are you sure you want to continue?

We've moved you to where you read on your other device.

Get the full title to continue

Get the full title to continue listening from where you left off, or restart the preview.

scribd