¡¢£¤¥¦§¨£©©¡¦¤¤¡¦¤¤©©©©©©¦!¡!"#$%&'()#$%&'(¡¦¤0©©1¢£¤¥¦§23£©©4 ¡¦¤¤©©©©©©¦!¡!"#$%&'()#$%&'(©©¡¦¤¤¡¦¤¤¡¦¤¤¡¢£¤¥¦§¨£©©¡¦¤¤¡¦¤¤©©©©©©¦!¡!"#$%&'()#$%&'(¡¦¤0©©1¢£¤¥¦§23£©©4 ¡¦¤¤©©©©©©¦!¡!"#$%&'()#$%&'(©©¡¦¤¤¡¦¤¤¡¦¤¤
Figure2.ExecutionﬂowsoftheCUFFT-basedand GPUFFTW-based libraries.
matrix before the ﬁrst phase, as illustrated in Figure 2 (a).On the other hand, when we use GPUFFTW, we ﬁrst trans-pose only the region computed by CPUs as in Figure 2 (b),and then the GPU region only at the last step of calculation.The current implementation performs all the transpose op-erations on CPUs; on dual-core machines, two threads areused for the transpose. Our next prototype will have thechoice of transposing matrices on GPUs as well in order toreduce data communication between CPUs and GPUs.
2.2.2 Memory Size Limitation
Memory sizes of GPUs require special consideration, sincetypical GPUs have a smaller amount of memory than that of host memory; even high-end GPUs currently have less thana gigabyte memory, while hosts typically have more thanseveral gigabyte memory. Moreover, the size of data thatcan be placed on a GPU is even smaller when OpenGL isused, since read-only texture memory and write-only framebu
er have to be allocated separately. Thus, we supportlarger computation by iterating GPU library calls so thateach call runs successfully within a given memory size.
2.2.3 CPU Load Balancing
We allocate di
erent sizes of data to be allocated to eachthread on CPUs. The reason behind this is load imbalancethat can be caused in GPGPU environments. Our prelim-inary experiments using CUDA version 0.8 on a dual-coreCPU indicated that the GPU control thread mostly occupiesone of the two cores. To run programs e
ciently even onsuch load-imbalanced CPU cores, we need to adjust datasize depending on the load of each core.
Table 1. Model Parameters
Matrix transposition on CPU
Memory allocation on GPU
Data transmission from CPU to GPU
Data transmission from GPU to CPU
Pre-processing of 1D-FFT on GPU
Post-processing of 1D-FFT on GPU
1D-FFT on CPU
1D-FFT on GPU
Memory releasing on GPU
3 Model-Based Optimal Load Balancing
Weﬁndoptimaldatadistributionratiosbyderivingaper-formance model of our FFT library. The model gives a pre-dicted execution time that are parameterized with the size of the input matrix,
, and the distribution ratio to a GPU,
.For a given 2D FFT of size
, we set
to the value that themodel predicts gives the shortest execution time. The restof this section describes the model of the CUFFT version of our library; because the model for the GPUFFTW version isnearly identical with that for the CUFFT version, we do notdiscuss it here for the sake of brevity. In addition, the belowmodel assumes that only one thread is used for CPU-sidecomputation; we have observed that whether using a singlethread or two does not change the best performance on theCPU-GPU conﬁguration used in our experiments.To derive the performance model, we divide the entireFFT computation into several sub steps, whose executiontime is represented with
shown in Ta-ble 1. These parameters denote primitive performance of underlying FFT libraries and hardware. For example, wemodel the CPU-to-GPU data transfer time as
,because we estimate that it increases linearly with the datasize. We determine the exact value of each model parame-ter through several preamble executions for a given speciﬁchardware conﬁguration.Figure 3 illustrates the execution steps of the CUFFTversion of our library. We have identiﬁed nine steps thatheavily a
ect total performance, such as 1D-FFT, transpo-sition, and data transfer. When a matrix size is larger thanthe available GPU memory, the library divides the matrix tosub matrices and iterates the steps for each sub matrix.Using the model parameters, we ﬁrst deﬁne the perfor-mance models of row- and column-major 1D FFT for CPUsandGPUs. WedeﬁnetheCPUmodelsforrow-andcolumn-
Authorized licensed use limited to: Hewlett-Packard via the HP Labs Research Library. Downloaded on May 25, 2009 at 05:15 from IEEE Xplore. Restrictions apply.