You are on page 1of 19

PROJECT REPORT

APPLIED PARALLEL PROGRAMMING

Advanced MRI Reconstruction

Final Project Report


Abdallah Abu Ghazaleh
Gokul Subramani
Gurudutt Ramaprasad
Karthik Manivannan
Sridhar Panneer selvam
1

INTRODUCTION
Modern GPUs are increasingly becoming an attractive platform for speeding up applications
ranging from something as huge as black hole simulations to DNA sequencing. The
innumerable application which gets speeded up due to CUDA is expanding every other day.
In order to learn and experience the effectiveness of CUDA, we took up the challenge of
speeding up computations involved in the Advanced (Magnetic Resonance Imaging) MRI
Reconstruction.
In this paper, we find an approach to accelerate the working of MRI image reconstruction
algorithm using CUDA C with the help of Graphics Processing Unit(GPU) GTX-480. The
raw image data that is captured by the coils as the K Space data needs to be processed before
the doctor can actually interpret the results of the MRI scan. This processing time takes a few
hours for the best algorithm implemented on CPU. It was observed that the algorithm
contained portions that exhibited high parallelism. These parallel data computations could be
computed at a much faster rate on the GPU. The bottlenecks of the MRI image reconstruction
algorithm were identified and these functions are implemented in CUDA C. This paper focuses
on the approach and the techniques used to convert the Matlab version of the algorithm into a
highly parallel CUDA C version.

MOTIVATION
The motivation for this project stems from the fact that MRI scan can be a long and
uncomfortable experience for the patients requiring them to lie down in the machine to closely
an hour unmoved. This project with the guidance of Prof. Dr. John Sartori and Prof. Mehmet
Akcakaya, aims at reducing the time required for the scan and the computations involved by
utilizing the GPU demon. Dr. Mehmet Akcakaya is currently exploring an algorithm called
LOST (LOw dimensional-structure Self learning and Thresholding), which adaptively finds a
sparse representation of the given image, using the various features in the image rather than
the pre-determined fixed transform domain.
The algorithm consists of two phases, one phase is responsible for the de-noising the image
for accuracy and another phase does the fourier transform for reconstruction.
We proposed that the scope for parallelism is huge during computation of fft and ifft, involving
an input size (4D data) as huge as 256*232*88*32. We planned to use the inbuilt CUDA
library, specialised for fft Cufft to parallely optimise the execution of the transformation.

DESIGN OVERVIEW:
The Discrete Fourier transform maps a complex valued vector into its frequency domain
representation given by

Figure 1. MRI Reconstruction overview


The Cufft library uses an algorithm with an O(NlogN), for an input data size of N. Out of the
various forms of fft, we used C2C(Complex input to complex output) fft. GTX - 480 processor
was used for running the code. The device capabilities are

Compute capability - 2.0


Micro architecture Fermi
Maximum number of threads/block - 1024
Maximum amount of shared memory 48kB
Constant Memory size 64kB
Global Memory 1.46GB DRAM
Pitched Memory 2.0GB

Cufft vs fftw : It was found in a study from University of Waterloo that Cufft is good for
larger sized fft implementation. Cufft starts to perform better than fftw around data sizes of

8192 elements. Therefore it can be safe to assume that Cufft works better than fftw for input
size greater than (~10,000 elements).
PROFILE SUMMARY of Matlab Code:

72%

Figure 2. Profile Summary

It was found that out of all the functions in the matlab code provided to us, only recons_cs,
data_consistency_3d, RecPF_denoise were consuming 72% (548 secs /752secs) of the
timing. Therefore it must be sensible only to convert those functions into CUDA C.

FLOW CHART:
We Input the image file (256*232*88*32) into the GPU's global
Step 1 memory
Find the centre dimensions to determine the Critical image area and
Step 2 filter the image using Tukeywin filter.
Perform Inverse Fast fourier transform for all coils and sum them to
Step 3 3D size.

Step 4

Step 5

Now denoise the image using traditional RecPF-denoise method.

Repeat the de-noise process for 25 iterations.

Figure 3. Flow chart


IMPLEMENTATION: The basics
We read our inputs from an asci file and we store the data in a cuComplex struct which is a
basic struct that has a two member variables x and y, representing the real and imaginart part
of the number respectively. Any sort of computation using cuComplex numbers must be
carried using some of the functions in the cuComplex.h library (the operators are not
overloaded to perform addition, multiplication etc).
We also use the cuFFT library which is part of the CUDA toolkit. cuFFT employs cuComplex
as it input data structure for any kind of FFT computation. cuFFT provides a method to do
both a forward and an inverse Fourier transform. We had to implement FFT shifts and FFT
inverse shifts ourselves.

The original K Space 3D data that is obtained


from the input is squeezed along the X
dimension to form a 2 D image data, where
the most critical data lie at the centre of the
image. It is needed to choose the critical data
from the input image through Y- start, Yend, Z-start and Z-end. Initially, the start and
end pints in both the dimensions point to the
centre most pixel of the image. We determine
a rectangle around this point which consists
of all valid data.
Figure 4.Centre
dimension
This rectangle now contains the key image data that needs to be filtered and enhanced the
most, compared to other pixels in the image. To preserve the centre data and to reduce the
noise of surrounding elements, we filter the image using tukeywin filter. Sum the filtered
image along all 32 coils to get 3D input image of size 256x232x88.
The filtered 3D image is sent to de-noise function called RefPFdenoise and an image estimate
is found. Then the estimated 3D image is compared with 4D original image for all 32 coils and
the essential i.e. critical part of the image is preserved while the noise is cancelled out. We do
the above process for 25 times to completely eliminate the noise from the image as shown in
figure 1.
As GPU Engineers we are interested in parallel part of the algorithm. So we take the original
image and filtered image as our input. From figure 2 since the function RecPF denoise and
dataconsistency3d are taking much long compared to other functions. We are interested
implementing these two in GPU.

IMPLEMENTATION: RecRF Denoise


The second part of the code that could exploit its parallelism was the RecRF_denoise function.
We spent around a total of 175 seconds/3 minutes processing data in that function. This
function gets called 24 times (one less than the 3D data consistency part). So each function
call takes about 7.3 seconds. Our goal is to beat that number.
The RecRF Denoise Matlab code contained a lot of computations and library calls to both the
image processing toolkit and the signal processing toolkit. There are numerous FFT (Fourier
transform) computations and numerous finite difference computations along with
6

manipulation of input data. The input data is the single image produced after each cycle of the
main reconstruction algorithm (mainly the 3D data consistency part).
The RecRF_Denoise first initializes constants for its finite difference method using constant
memory and compiles that to the device. It also initializes optimization parameters into
constant memory as well. The values in constant memory are retained across kernel calls.
Then RecRF_Denoise starts by calculating initial parameters for the calculation loop which
are a denominator and a numerator, both which require a FFT to be performed.
Following those initializations, we start our main loop which is divided into a few parts. We
would have wanted to create only one kernel for the entire processing of data, but because we
need to call cuFFT on our data throughout, we must divide the calculations into parts, and save
our current states using global memory so we can resume with cuFFT calculation outputs in
our calculations. The loop runs three times and solves the three big parts of the MATLAB
code: the W sub problem, the U sub problem and ends with Bergman update of values for next
run. We write the brute force approach to the code and then start optimizing.
For some of the computations, we reordered the variables in order to minimize number of
variables used in the brute force approach (and in the Matlab code). By removing the extra
variables we were able to minimize memory consumption and increase our performance.
Additionally, parts of the Matlab code were not very efficient: for example data that was
calculated but not used, etc. We also optimized the code to remove those necessities.
For the Fourier Transform code, the implementation was pretty straight forward. For the finite
difference methods, we used code that was published on the NVIDIA blog [5] and modified it
to run without input size, and with cuComplex data. The code used shared memory and we
had to play around with shared memory size because initially with our input size, the compiler
declared that we were allocating more shared memory than the device could allocate.
Combining the code proved a little challenging (we started running into problems with the
finite differences when we tried to put them all in one kernel call) so we abandoned that idea
specifically, especially since for the recRF denoise part, there were problems with the
verification.

VERIFICATION: RecPF Denoise


For the RecRF Denoise part, we started analysing the data against expected results (run by the
Matlab code) and we observed our output was entirely NaN (a problem we encountered when
running our Matlab initially with wrong configuration parameters). This led to an investigation

of which parts were the problem. Starting from the beginning we notice that the results for the
numerator and denominator are not the same as those from Matlab.
At the moment, the method provides all nan + j nan for all its outputs. Sample below for first
two expected elements: -2.483373e-10+j-1.389864e-09 1.088936e-10+j6.383455e-10.
Further work should go into looking at what is causing such a problem. We ran out of time to
continue to triage this.

PERFORMANCE
Initially we implemented a CUDA program to perform the RecRF_Denoise function, which
achieved a performance of 40 seconds over 10 iterations. We noticed later that the calculations
for averages were incorrect (we werent dividing by 10) and thus that gave us a performance
of about 4 seconds per each iteration of the loop. This is smaller than 7 seconds so we are
getting some performance improvements (not much yet though). After further analysis, we
discovered that building using nvcc was building in debug mode, and after creating a new
MakeFile and running the kernel, we started achieving a huge performance increase. Now we
are getting less than half a second for run time. Thats about a 14x improvement (not a x100
yet but getting some great improvement).
We tried to change block size of finite difference part but we noticed not much increase in
performance by changing the sizes (we were already using long stencil size in our shared
memory which was already optimized).
The memory cost is huge for RecRF_Denoise, besides the input and output data, we allocate
10 other matrices to store our data, for a total cost of 256*232*88*8*(10+2) = 0.501743616
Gigabytes. Its a little hard to calculate if we are bounded by memory or computation in our
case.

IMPLEMENTATION: dataconsistency3d
The source matlab function for dataconsistency3d function is shown below
sig_check = fftshift(fftn(ifftshift(img, 1)));
sig_check(picks) = sig(picks);
z = fftshift(ifftn(ifftshift(sig_check)), 1);

The main goal of this function is to retain the critical part of the image. During the RecPF denoise process all the image data are de-noised. Picks has the index of all critical parts of the
data. Sig is the original 4D data. Perform fft, replace with critical part, perform ifft and again
8

send it to RecPF denoise process. This process continues for 25 times. Since lots of 3D fft and
ifft are performed, cuda implementation will fasten the process.

Conversion of matlab funtions such as fftshift and ifftshift into c code is the first step. Kernel
code for fftshift and ifftshift are implemented. The following table shows the comparison time
taken of fftshift & ifftshift function to run in CPU vs GPU. Input size is same as 3d matrix size
256x232x88.
Table 1. FFT and IFFT shift function CPU vs GPU
Time
taken
execute in CPU

to Time
taken
execute in GPU

to Performance gain

Fftshift function

12.5ms

0.51ms

24.5x

Ifftshift function

12.945ms

0.753ms

17.19x

The fft and ifft functions are performed using cufft functions. The first step to perform a cufft
is to create a plan usinf cufftPlan function. A plan once created can be used later for subsequent
calls. We give the x,y,z dimensions to the plan and its type of computation say complex to
complex etc. The fft and ifft calls are made by executing the plan using cufftExec function.
The table below shows the time taken for dataconsistency3d function to execute in matlab and
GPU.
CUFFT Implementation in MATLAB :
cufftHandle plan_fft;
cufftPlan2d(&plan_fft, y,x,CUFFT_C2C);
cufftExecC2C(plan_fft, img_d, out_d, CUFFT_FORWARD);
cudaDeviceSynchronize();
CUFFT Implementation in MATLAB for IFFT:
cufftHandle plan_ifft;
cufftPlan2d(&plan_ifft, y,x,CUFFT_C2C);
cufftExecC2C(plan_ifft, img_d, out_d, CUFFT_INVERSE);
cudaDeviceSynchronize();
9

Table 2. Time taken Matlab Vs GPU for various Input Sizes

No of Coils
Used

Input sizes for


dataconsistency3d
function(in MB)

Time taken to
execute in
MATLAB(t1 in
ms)

Time taken to
execute in
GPU(t2 in
ms)

Performance
gain = (t1/t2)

39.875

975

57

17.10526316

79.75

1437

103.927

13.82701319

159.5

2083

200

10.415

319

4106

385.53

10.65027365

16

638

7761

777.987

9.975745096

32

39.875

15356

1455

10.55395189

Time taken - Matlab vs GPU


18000
16000

Time in ms

14000
12000
10000
8000
6000
4000
2000
0
39.875MB

79.75MB

159.5MB

319MB

638MB

1276MB

Input Size in MB
Time taken to execute in MATLAB(t1 in ms)

Time taken to execute in GPU(t2 in ms)

Figure 5. Time taken Matlab Vs GPU for various Input Sizes

10

Performance gain for various input sizes


18
16
14

Speed up x

12
10
8
6
4

2
0
39.875MB

79.75MB

159.5MB

319MB

638MB

1276MB

Input Size in MB
Figure 6. Performance gain for various input sizes

In the MATLAB implementation, the time taken to compute the data consistency 3D function
increases drastically with increase in input size, reaching a maximum of about 16s for all the
32 coils. Whereas in the CUDA implementation, the maximum time is only 1.5s for all the 32
coils.
The FFT kernel is launched once for every coil. As the input coil size increases, the overhead
due to memory transfers between the host and the device memory also increases. Currently,
only 1 stream is being used for performing FFT / IFFT. Using multiple streams to pipeline the
tasks of memcopy and kernel execution can increase the performance further. We
experimented with multiple streams and software pipelining to further improve the throughput.
But, that causes output mismatch and we need to do further research to solve this issue. We
have attached all the implementation of the code in the Appendix for your reference.

VERIFICATION: dataconsistency3d
11

Validating Reconstructed Image Using Peak Signal-to-Noise Ratio:


The Mean Square Error (MSE) and the Peak Signal to Noise Ratio (PSNR) are the two error
metrics used to compare image compression quality. The MSE represents the cumulative
squared error between the compressed and the original image, whereas PSNR represents a
measure of the peak error.

I0 is a known, perfect answer. This is typically found by performing the function in matlab.
The following table shows MSE and PSNR for different coil inputs.
Table 3. MSE and PSNR for various coils
Coils
Coil1
Coil2
Coil3
Coil4
Coil5
Coil6
Coil7
Coil8
Coil9
Coil10
Coil11
Coil12
Coil13
Coil14
Coil15
Coil16

Mean Square
Error
2.76485E-05
0.000262461
2.52628E-05
2.45329E-05
4.57182E-07
6.1467E-06
1.42437E-05
1.46217E-06
9.14289E-05
0.00010088
6.15868E-06
0.000287483
0.000110432
4.6884E-05
6.61326E-05
0.000187387

Maximum Io
0.8731
0.8297
0.8984
0.8798
0.9082
0.9304
0.9319
0.9719
0.97
0.9363
0.9399
0.9498
0.978
0.9709
0.9484
0.8771

PSNR(db)
44.40455724
34.18777834
45.0445775
44.99019378
62.56273853
51.48697426
47.85117212
58.10246753
40.12459774
39.39023257
51.56675881
34.96651801
39.37581573
43.03324413
41.33567204
36.13359831
12

Coil17
Coil18
Coil19
Coil20
Coil21
Coil22
Coil23
Coil24
Coil25
Coil26
Coil27
Coil28
Coil29
Coil30
Coil31
Coil32

8.53287E-06
9.77867E-06
1.49731E-06
1.22615E-05
3.36993E-05
2.44395E-05
7.40974E-05
4.89955E-05
0.000186808
4.91792E-05
1.26168E-05
5.07543E-05
7.33233E-05
2.85137E-06
1.08874E-05
9.10939E-06

0.9545
0.8778
0.9177
0.8416
0.9124
0.8917
0.8703
0.8472
0.8877
0.9376
0.9754
0.9687
0.9274
0.9467
0.9251
0.9596

50.28456843
48.9651119
57.50088891
47.6166772
43.92750306
45.12345043
40.09534734
41.65815335
36.25137087
42.52253555
48.77417303
42.66905996
40.69292106
54.97370418
48.95454347
50.04691095

Figure 7. PSNR of all Coils

PSNR for all Coils


70

PSNR in decibels

60
50
40
30
20
10

C1
C2
C3
C4
C5
C6
C7
C8
C9
C10
C11
C12
C13
C14
C15
C16
C17
C18
C19
C20
C21
C22
C23
C24
C25
C26
C27
C28
C29
C30
C31
C32

Coils
As can be seen from the graph above, an average PSNR of 45 dB is obtained from the output
of all coils.
13

CONCLUSION
The bottleneck functions in the MATLAB code for MRI Reconstruction were identified. Of
the two bottleneck functions, the RecPF denoise function did not exhibit any kind of
parallelism and works better when implemented sequentially. The parallel implementation of
the other function, data consistency 3d, however showed considerable speedup (10.55x) over
its sequential counterpart. Further improvements can be made by performing the cuda FFT /
IFFT of multiple coils simultaneously over parallel streams. The FFT and IFFT shift were
implemented in the native CPU as well as in CUDA GPU. The parallel implementation of
FFT shift is found to be 24x faster, while the IFFT shift is found to be 17x faster. Overall, we
experimented with multiple ways to improve the existing MATLAB code and were able to
successfully accelerate the computational speed.

REFERENCE
1. https://developer.nvidia.com/cufft.
2. GPU computing gems (Emerald Edition) Wen-mei W. Hwu, ISBN: 978-0-12384988-5.
3. http://developer.nvidia.com/object/matlab_cuda.html.
4. http://mri-q.com/index.html
5. http://devblogs.nvidia.com/parallelforall/finite-difference-methods-cuda-cc-part-1/
6. University of Waterloo. (2007).
http://www.science.uwaterloo.ca/hmerz/CUDA_benchFFT/

APPENDIX A. CUDA kernel code for FFT shift


_global__ void cufftShift_2D_kernel(cuComplex* input, cuComplex* output,
int Nx, int Ny)
14

{
// 2D Slice & 1D Line
int sLine = Ny;
int sSlice = Nx * Ny;
// Transformations Equations
int sEq1 = (sSlice + sLine) / 2;
int sEq2 = (sSlice - sLine) / 2;
__syncthreads();
// Thread Index (1D)
int xThreadIdx = threadIdx.x;
int yThreadIdx = threadIdx.y;
__syncthreads();
// Block Width & Height
int blockWidth = blockDim.x;
int blockHeight = blockDim.y;
__syncthreads();
// Thread Index (2D)
int xIndex = blockIdx.x * blockWidth + xThreadIdx;
int yIndex = blockIdx.y * blockHeight + yThreadIdx;
__syncthreads();
// Thread Index Converted into 1D Index
int index = (yIndex * Nx) + xIndex;
__syncthreads();
if (xIndex < Nx / 2)
{
if (yIndex < Ny / 2)
{
// First Quad
output[index] = input[index + sEq1];
__syncthreads();
}
else
{
// Third Quad
output[index] = input[index - sEq2];
__syncthreads();
}
}
else
{
15

if (yIndex < Ny / 2)
{
// Second Quad
output[index] = input[index + sEq2];
__syncthreads();
}
else
{
// Fourth Quad
output[index] = input[index - sEq1];
__syncthreads();
}
}
}

APPENDIX B. CUDA kernel code for IFFT shift


__global__ void cuifftShift_2D_kernel(cuComplex* input, cuComplex* output, int
Nx, int Ny)
{
// 2D Slice & 1D Line
int sLine = Ny;
int sSlice = Nx * Ny;
// Transformations Equations
int sEq1 = (sSlice + sLine) / 2;
int sEq2 = (sSlice - sLine) / 2;
__syncthreads();
// Thread Index (1D)
int xThreadIdx = threadIdx.x;
int yThreadIdx = threadIdx.y;
__syncthreads();
// Block Width & Height
int blockWidth = blockDim.x;
int blockHeight = blockDim.y;
__syncthreads();
// Thread Index (2D)
int xIndex = blockIdx.x * blockWidth + xThreadIdx;
int yIndex = blockIdx.y * blockHeight + yThreadIdx;
__syncthreads();
// Thread Index Converted into 1D Index
int index = (yIndex * Nx) + xIndex;
__syncthreads();

16

if (xIndex < Nx / 2)
{
// First Half
output[index] = input[index + Nx/2];
__syncthreads();
}
else
{
// Second Half
output[index] = input[index - Nx/2];
__syncthreads();
}
}

APPENDIX C.CUDA code for FFT/IFFT implementation using multiple streams


cudaStream_t streams[4];
for(int i=0; i<8 ; i++) {
dim3 dimblock(16,16,1);
dim3 dimgrid(256/16,232*88/16,1);
cudaMemcpy(img_d, img_h+i*inputSize, inputSize*sizeof(cuComplex),
cudaMemcpyHostToDevice);
cuifftShift_2D_kernel<<<dimgrid,dimblock>>>(in_d_0,img_d,x,y);
cudaDeviceSynchronize();
cudaMemcpy(img_d, img_h+(i+1)*inputSize, inputSize*sizeof(cuComplex),
cudaMemcpyHostToDevice);
cuifftShift_2D_kernel<<<dimgrid,dimblock>>>(in_d_1,img_d,x,y);
cudaDeviceSynchronize();
cudaMemcpy(img_d, img_h+(i+2)*inputSize, inputSize*sizeof(cuComplex),
cudaMemcpyHostToDevice);
cuifftShift_2D_kernel<<<dimgrid,dimblock>>>(in_d_2,img_d,x,y);
cudaDeviceSynchronize();
cudaMemcpy(img_d, img_h+(i+3)*inputSize, inputSize*sizeof(cuComplex),
cudaMemcpyHostToDevice);
cuifftShift_2D_kernel<<<dimgrid,dimblock>>>(in_d_3,img_d,x,y);
cudaDeviceSynchronize();
cufftHandle* plans = (cufftHandle*) malloc(sizeof(cufftHandle)*2);
for (int i = 0; i < 4; i++) {
cufftPlan2d(&plans[i], y,x, CUFFT_C2C );
cufftSetStream(plans[i], streams[i]);
}
cufftExecC2C(plans[0],
cufftExecC2C(plans[1],
cufftExecC2C(plans[2],
cufftExecC2C(plans[3],

in_d_0,
in_d_1,
in_d_2,
in_d_3,

out_d_0,
out_d_1,
out_d_2,
out_d_3,

CUFFT_FORWARD);
CUFFT_FORWARD);
CUFFT_FORWARD);
CUFFT_FORWARD);

17

for(int i = 0; i < 4; i++)


cudaStreamSynchronize(streams[i]);
cufftShift_2D_kernel<<<dimgrid,dimblock>>>(out_d_00,out_d_0,x,y);
cudaDeviceSynchronize();
cudaMemcpy(out_h+i*(inputSize), out_d_00, inputSize*sizeof(cuComplex),
cudaMemcpyDeviceToHost);
cufftShift_2D_kernel<<<dimgrid,dimblock>>>(out_d_11,out_d_1,x,y);
cudaDeviceSynchronize();
cudaMemcpy(out_h+(i+1)*inputSize, out_d_11, inputSize*sizeof(cuComplex),
cudaMemcpyDeviceToHost);
cufftShift_2D_kernel<<<dimgrid,dimblock>>>(out_d_22,out_d_2,x,y);
cudaDeviceSynchronize();
cudaMemcpy(out_h+(i+2)*inputSize, out_d_00, inputSize*sizeof(cuComplex),
cudaMemcpyDeviceToHost);
cufftShift_2D_kernel<<<dimgrid,dimblock>>>(out_d_33,out_d_3,x,y);
cudaDeviceSynchronize();
cudaMemcpy(out_h+ (i+3)*inputSize, out_d_00, inputSize*sizeof(cuComplex),
cudaMemcpyDeviceToHost);

for(int l=0 ;l<picks_size*4; l++)


{
out_h[picks_h[l]].x = sig_h[picks_h[l]].x;
out_h[picks_h[l]].y = sig_h[picks_h[l]].y;
}

cudaMemcpy(out_d_00, out_h+i*inputSize, inputSize*sizeof(cuComplex),


cudaMemcpyHostToDevice);
cudaMemcpy(out_d_11, out_h + (i+1)*inputSize, inputSize*sizeof(cuComplex),
cudaMemcpyHostToDevice);
cufftShift_2D_kernel<<<dimgrid,dimblock>>>(in_d_0,out_d_00,x,y);
cudaDeviceSynchronize();
cufftShift_2D_kernel<<<dimgrid,dimblock>>>(in_d_1,out_d_11,x,y);
cudaDeviceSynchronize();
cudaMemcpy(out_d_22, out_h+ (i+2)*inputSize, inputSize*sizeof(cuComplex),
cudaMemcpyHostToDevice);
cudaMemcpy(out_d_33, out_h + (i+3)*inputSize, inputSize*sizeof(cuComplex),
cudaMemcpyHostToDevice);
cufftShift_2D_kernel<<<dimgrid,dimblock>>>(in_d_2,out_d_22,x,y);
cudaDeviceSynchronize();
cufftShift_2D_kernel<<<dimgrid,dimblock>>>(in_d_3,out_d_33,x,y);
cudaDeviceSynchronize();
cufftExecC2C(plans[0], in_d_0, out_d_0, CUFFT_INVERSE);
cufftExecC2C(plans[1], in_d_1, out_d_1, CUFFT_INVERSE);
cufftExecC2C(plans[0], in_d_2, out_d_2, CUFFT_INVERSE);
cufftExecC2C(plans[1], in_d_3, out_d_3, CUFFT_INVERSE);
for(int i = 0; i < 4; i++)
cudaStreamSynchronize(streams[i]);

18

cuifftShift_2D_kernel<<<dimgrid,dimblock>>>(out_d,out_d_0,x,y);
cudaDeviceSynchronize();
cudaMemcpy(out_h+i*inputSize, out_d, inputSize*sizeof(cuComplex),
cudaMemcpyDeviceToHost);
cuifftShift_2D_kernel<<<dimgrid,dimblock>>>(out_d,out_d_1,x,y);
cudaDeviceSynchronize();
cudaMemcpy(out_h+(i+1)*inputSize, out_d, inputSize*sizeof(cuComplex),
cudaMemcpyDeviceToHost);
cuifftShift_2D_kernel<<<dimgrid,dimblock>>>(out_d,out_d_2,x,y);
cudaDeviceSynchronize();
cudaMemcpy(out_h+(i+2)*inputSize, out_d, inputSize*sizeof(cuComplex),
cudaMemcpyDeviceToHost);
cuifftShift_2D_kernel<<<dimgrid,dimblock>>>(out_d,out_d_3,x,y);
cudaDeviceSynchronize();
cudaMemcpy(out_h+(i+3)*inputSize, out_d, inputSize*sizeof(cuComplex),
cudaMemcpyDeviceToHost);

19