Welcome to Scribd, the world's digital library. Read, publish, and share books and documents. See more
Download
Standard view
Full view
of .
Save to My Library
Look up keyword
Like this
5Activity
0 of .
Results for:
No results containing your search query
P. 1
Cufft Performance Graphs

Cufft Performance Graphs

Ratings: (0)|Views: 1,815 |Likes:
Published by api-12797690

More info:

Published by: api-12797690 on May 25, 2009
Copyright:Attribution Non-commercial

Availability:

Read on Scribd mobile: iPhone, iPad and Android.
download as PDF, TXT or read online from Scribd
See more
See less

11/13/2012

pdf

text

original

 
An E
cient, Model-Based CPU-GPU Heterogeneous FFT Library
Yasuhito Ogata
1
,
3
, Toshio Endo
1
,
3
, Naoya Maruyama
1
,
3
, and Satoshi Matsuoka
1
,
2
,
31
Tokyo Institute of Technology
2
National Institute of Informatics
3
JST, CRESTogata, endo@matsulab.is.titech.ac.jpnaoya.maruyama, matsu@is.titech.ac.jp
Abstract
General-Purpose computing on Graphics ProcessingUnits (GPGPU) is becoming popular in HPC because of its high peak performance. However, in spite of the poten-tial performance improvements as well as recent promis-ing results in scientific computing applications, its real per- formance is not necessarily higher than that of the current high-performance CPUs, especially with recent trends to-wards increasing the number of cores on a single die. Thisis because the GPU performance can be severely limited bysuch restrictions as memory size and bandwidth and pro-gramming using graphics-specific APIs. To overcome this problem, we propose a model-based, adaptive library for 2D FFT that automatically achieves optimal performanceusing available heterogeneous CPU-GPU computing re-sources. To find optimal load distribution ratios betweenCPUs and GPUs, we construct a performance model that captures the respective contributions of CPU vs. GPU,and predicts the total execution time of 2D-FFT for arbi-trary problem sizes and load distribution. The performancemodel divides the FFT computation into several small substeps, and predicts the execution time of each step using profiling results. Preliminary evaluation with our prototypeshows that the performance model can predict the executiontime of problem sizes that are 16 times as large as the pro- file runs with less than 20% error, and that the predicted optimal load distribution ratios have less than 1% error.We show that the resulting performance improvement us-ing both CPUs and GPUs can be as high as 50% compared to using either a CPU core or a GPU.
1 Introduction
General Purpose computing on Graphics ProcessingUnits (GPGPU) is becoming popular in HPC. While thepeak performance of a modern CPU is approximately 50GFlops per die, the latest high-end GPUs boast over 500GFlops of performance. Another advantage in using GPUsfor HPC is its cost performance; they are less expensive dueto commoditization than dedicated hardware acceleratorsfor numerical computation such as GRAPE [9] or Clear-Speed [3], while still achieving equal or sometimes evenbetter performance.However, in spite of the potential performance improve-ments as well as recent promising results in scientific com-puting applications ( [6,10] amongst many others), its realperformance is not necessarily higher than that of the cur-rent high-performance CPUs, especially with recent trendstowards increasing the number of cores on a single die. Thereason of lower real performance of GPUs is three-fold.First, raw GPU performance improvements can be sacri-ficed considerably by the overhead of memory transfer be-tween CPUs and GPUs. Second, the size of GPU mem-ory and its usage restriction limit the maximum data sizethat can be computed by a single data transfer to a GPU,thus further magnifying the memory bandwidth limitation.Third, GPUpeakperformanceisnotalwaysachievablewithstandard graphics APIs, such as DirectX or OpenGL. Al-though the recent GPGPU programming environments suchasCUDA[11]couldreducetheoverheadbyprovidingmoredirect APIs for general scientific computing, the program-ming model is not yet straightforward to a standard pro-grammer. More importantly, manually conducting properload balancing under a heterogeneous GPU-CPU environ-ment for arbitrary data sizes remains a di
cult challenge.The objective of this research is to realize an adaptivelibrary for 2D FFT that automatically achieves optimal per-formance using available heterogeneous CPU-GPU com-puting resources. FFT is used for multitudes of applica-tions such as image and signal processing and 3D proteinanalysis. An important research question thereof is howto find the optimal load distribution among such hetero-geneous resources, since the execution time on each re-source must be balanced under di
ff 
erent CPU-GPU con-
Authorized licensed use limited to: Hewlett-Packard via the HP Labs Research Library. Downloaded on May 25, 2009 at 05:15 from IEEE Xplore. Restrictions apply.
 
figurations, problem sizes, etc. Although there have beenpast attempts for heterogeneous clusters or grids with dif-ferent CPU speeds [1], little is known for the combinationof CPUs and GPUs.This paper presents our 2D-FFT algorithm and itsmodel-based optimal load distribution technique. The algo-rithm is based on the row-column method, and can exploitarbitrary combinations of heterogeneous CPUs and GPUsby distributing 1D FFT computation among them. To findoptimal load distribution ratios, we construct a performancemodel that captures the respective contributions of CPU vs.GPU, and predicts the total execution time of 2D-FFT of a given matrix. The performance model divides the FFTcomputation into several small sub steps, and predicts theexecution time of each step using profiling results on a par-ticular CPU-GPU combination.To evaluate the performance of our 2D FFT and thee
ff 
ectiveness of performance modeling in finding optimalload distribution, we have implemented a prototype CPU-GPU heterogeneous FFT library using existing 1D FFT li-braries. Preliminary evaluation with the prototype showsthat the performance model can predict the execution timeof problem sizes that are 16 times as large as the profile runswith less than 20% error, and that the predicted optimal loaddistribution ratios have less than 1% error. We show thatthe resulting performance improvement using both CPUsand GPUs can be as high as 50% compared to using eitherCPUs or GPUs.
2 Proposed Model-Based 2D-FFT LibraryUsing Both CPUs and GPUs
Our 2D-FFT library is based on the
row-column
method,which first executes column-order 1D FFT for all thecolumns of the input 2D matrix, and then row-order 1D FFTfor all the rows. In each 1D FFT, the rows and columns aredistributed among available computing resources: in thiswork, CPUs and GPUs. Figure 1 illustrates an instance of such distribution, where 65% of computation is allocated toa GPU, 25% to a CPU, and 10% to another CPU. Determin-ing the optimal distribution ratio is an important researchchallenge for e
ff 
ective use of such heterogeneous hardwarecombinations; we show that our model-based approach cansignificantly automate this process in Section 3. In this sec-tion, we describe underlying FFT libraries that we use for1D FFT and present detailed algorithm of our 2D FFT.
2.1 Underlying FFT libraries
As 1D FFT implementations, the current prototype usesFFTW [5] for FFT on CPUs, and GPUFFTW [8] andCUFFT [11] on GPUs. FFTW is a high performance FFT
¡¢¢¡£¤¥¦§¨©§¨©§¨©©¦§¨§¨§¨¡¢¢¡£¤¥¦§¨©§¨©§¨©©¦§¨§¨§¨
Figure 1. Example distribution of FFT compu-tation to one GPU and two CPUs
library for CPUs developed by Frigo et al. It automat-ically selects an optimal algorithm by measuring perfor-mance of pre-stage executions. GPUFFTW is an FFT li-brary for GPUs developed by Govindaraju et al. A notablefeature is an optimization based on automatically-identifiedstructure of graphics memory. NVIDIA’s CUFFT is a 2DFFT library for GPUs implemented in the CUDA program-ming environment [11]. Although CUDA is limited toNVIDIA’s GeForce 8000 and later, it allows programmersto write GPGPU programs in a C-like language without us-ing graphics-specific APIs such as OpenGL. It also elimi-nates restrictions in traditional GPGPU environments suchas read-only texture memory, thus achieving higher perfor-mance than traditional GPGPU environments.Note that these libraries support multi-row 1D-FFT witha single library call, which is an important property for bet-ter performance on GPUs, since each library call is accom-panied by CPU-GPU data communication. While the de-sign of our prototype allows the use of other 1D-FFT im-plementations other than GPUFFTW or CUFFT, alternativelibraries should support multi-row 1D FFT for better per-formance.
2.2 Detailed Algorithm
Figure 2 illustrates the execution flow of our library. Itexecutes column-order 1D-FFT first, and then row-order1D-FFT. Here, we describe three technical considerationsthat are specific to our 2D FFT algorithm.
2.2.1 Data Format Conversion
We have to convert data format according to the underlying1D-FFT libraries. Specifically, FFTW and CUFFT requirethe row-major format, while GPUFFTW the column-majorformat. Thus, when we use CUFFT, we transpose the whole
Authorized licensed use limited to: Hewlett-Packard via the HP Labs Research Library. Downloaded on May 25, 2009 at 05:15 from IEEE Xplore. Restrictions apply.
 
¡¢£¤¥¦§¨£©©¡¦¤¤¡¦¤¤©©©©©©¦!¡!"#$%&'()#$%&'(¡¦¤0©©1¢£¤¥¦§2©©¡¦¤¤©©©©©©¦!¡!"#$%&'()#$%&'(©©¡¦¤¤¡¦¤¤¡¦¤¤¡¢£¤¥¦§¨£©©¡¦¤¤¡¦¤¤©©©©©©¦!¡!"#$%&'()#$%&'(¡¦¤0©©1¢£¤¥¦§2©©¡¦¤¤©©©©©©¦!¡!"#$%&'()#$%&'(©©¡¦¤¤¡¦¤¤¡¦¤¤
Figure2.ExecutionflowsoftheCUFFT-basedand GPUFFTW-based libraries.
matrix before the first phase, as illustrated in Figure 2 (a).On the other hand, when we use GPUFFTW, we first trans-pose only the region computed by CPUs as in Figure 2 (b),and then the GPU region only at the last step of calculation.The current implementation performs all the transpose op-erations on CPUs; on dual-core machines, two threads areused for the transpose. Our next prototype will have thechoice of transposing matrices on GPUs as well in order toreduce data communication between CPUs and GPUs.
2.2.2 Memory Size Limitation
Memory sizes of GPUs require special consideration, sincetypical GPUs have a smaller amount of memory than that of host memory; even high-end GPUs currently have less thana gigabyte memory, while hosts typically have more thanseveral gigabyte memory. Moreover, the size of data thatcan be placed on a GPU is even smaller when OpenGL isused, since read-only texture memory and write-only framebu
ff 
er have to be allocated separately. Thus, we supportlarger computation by iterating GPU library calls so thateach call runs successfully within a given memory size.
2.2.3 CPU Load Balancing
We allocate di
ff 
erent sizes of data to be allocated to eachthread on CPUs. The reason behind this is load imbalancethat can be caused in GPGPU environments. Our prelim-inary experiments using CUDA version 0.8 on a dual-coreCPU indicated that the GPU control thread mostly occupiesone of the two cores. To run programs e
ciently even onsuch load-imbalanced CPU cores, we need to adjust datasize depending on the load of each core.
Table 1. Model Parameters
Param Description
tr
Matrix transposition on CPU
gMa
Memory allocation on GPU
c2g
Data transmission from CPU to GPU
g2c
Data transmission from GPU to CPU
gP
Pre-processing of 1D-FFT on GPU
gDP
Post-processing of 1D-FFT on GPU
cFFT
1D-FFT on CPU
gFFT
1D-FFT on GPU
gMF
Memory releasing on GPU
3 Model-Based Optimal Load Balancing
Wefindoptimaldatadistributionratiosbyderivingaper-formance model of our FFT library. The model gives a pre-dicted execution time that are parameterized with the size of the input matrix,
n
×
n
, and the distribution ratio to a GPU,
.For a given 2D FFT of size
n
, we set
to the value that themodel predicts gives the shortest execution time. The restof this section describes the model of the CUFFT version of our library; because the model for the GPUFFTW version isnearly identical with that for the CUFFT version, we do notdiscuss it here for the sake of brevity. In addition, the belowmodel assumes that only one thread is used for CPU-sidecomputation; we have observed that whether using a singlethread or two does not change the best performance on theCPU-GPU configuration used in our experiments.To derive the performance model, we divide the entireFFT computation into several sub steps, whose executiontime is represented with
model parameters
shown in Ta-ble 1. These parameters denote primitive performance of underlying FFT libraries and hardware. For example, wemodel the CPU-to-GPU data transfer time as
c
2
g
×
×
n
2
,because we estimate that it increases linearly with the datasize. We determine the exact value of each model parame-ter through several preamble executions for a given specifichardware configuration.Figure 3 illustrates the execution steps of the CUFFTversion of our library. We have identified nine steps thatheavily a
ff 
ect total performance, such as 1D-FFT, transpo-sition, and data transfer. When a matrix size is larger thanthe available GPU memory, the library divides the matrix tosub matrices and iterates the steps for each sub matrix.Using the model parameters, we first define the perfor-mance models of row- and column-major 1D FFT for CPUsandGPUs. WedefinetheCPUmodelsforrow-andcolumn-
Authorized licensed use limited to: Hewlett-Packard via the HP Labs Research Library. Downloaded on May 25, 2009 at 05:15 from IEEE Xplore. Restrictions apply.

Activity (5)

You've already reviewed this. Edit your review.
1 hundred reads
1 thousand reads
Djgreene730 liked this
jackin liked this
r7ayma liked this

You're Reading a Free Preview

Download
/*********** DO NOT ALTER ANYTHING BELOW THIS LINE ! ************/ var s_code=s.t();if(s_code)document.write(s_code)//-->