© All Rights Reserved

48 views

© All Rights Reserved

- convolution
- Petrel 2014 Installation Guide
- FFT
- Vulkan Klein
- A complex project build by cmake
- An advancement to the security level through the Galois field in the existing password based approach of hiding classified information in images
- Quality Assurance for MRI by AAPM
- 14 15 Little Red Book of Neurology
- 20120140505010
- Imageenhancement.pdf
- CUDA Developer Guide for Optimus Platforms
- Simegnew Yihunie
- Mr Simon Moyes - Osteochondral lesions of the talus
- Measuring disease activity in Crohn’s disease: what is currently available to the clinician
- Brain Abscess Seminar 2000
- 4513sipij02.pdf
- IJEST12-04-12-043.pdf
- 59_HowThingsWork
- VolumeI
- ECE472 Sample Fin Project FFT Synthesis-3

You are on page 1of 19

Abdallah Abu Ghazaleh

Gokul Subramani

Gurudutt Ramaprasad

Karthik Manivannan

Sridhar Panneer selvam

1

INTRODUCTION

Modern GPUs are increasingly becoming an attractive platform for speeding up applications

ranging from something as huge as black hole simulations to DNA sequencing. The

innumerable application which gets speeded up due to CUDA is expanding every other day.

In order to learn and experience the effectiveness of CUDA, we took up the challenge of

speeding up computations involved in the Advanced (Magnetic Resonance Imaging) MRI

Reconstruction.

In this paper, we find an approach to accelerate the working of MRI image reconstruction

algorithm using CUDA C with the help of Graphics Processing Unit(GPU) GTX-480. The

raw image data that is captured by the coils as the K Space data needs to be processed before

the doctor can actually interpret the results of the MRI scan. This processing time takes a few

hours for the best algorithm implemented on CPU. It was observed that the algorithm

contained portions that exhibited high parallelism. These parallel data computations could be

computed at a much faster rate on the GPU. The bottlenecks of the MRI image reconstruction

algorithm were identified and these functions are implemented in CUDA C. This paper focuses

on the approach and the techniques used to convert the Matlab version of the algorithm into a

highly parallel CUDA C version.

MOTIVATION

The motivation for this project stems from the fact that MRI scan can be a long and

uncomfortable experience for the patients requiring them to lie down in the machine to closely

an hour unmoved. This project with the guidance of Prof. Dr. John Sartori and Prof. Mehmet

Akcakaya, aims at reducing the time required for the scan and the computations involved by

utilizing the GPU demon. Dr. Mehmet Akcakaya is currently exploring an algorithm called

LOST (LOw dimensional-structure Self learning and Thresholding), which adaptively finds a

sparse representation of the given image, using the various features in the image rather than

the pre-determined fixed transform domain.

The algorithm consists of two phases, one phase is responsible for the de-noising the image

for accuracy and another phase does the fourier transform for reconstruction.

We proposed that the scope for parallelism is huge during computation of fft and ifft, involving

an input size (4D data) as huge as 256*232*88*32. We planned to use the inbuilt CUDA

library, specialised for fft Cufft to parallely optimise the execution of the transformation.

DESIGN OVERVIEW:

The Discrete Fourier transform maps a complex valued vector into its frequency domain

representation given by

The Cufft library uses an algorithm with an O(NlogN), for an input data size of N. Out of the

various forms of fft, we used C2C(Complex input to complex output) fft. GTX - 480 processor

was used for running the code. The device capabilities are

Micro architecture Fermi

Maximum number of threads/block - 1024

Maximum amount of shared memory 48kB

Constant Memory size 64kB

Global Memory 1.46GB DRAM

Pitched Memory 2.0GB

Cufft vs fftw : It was found in a study from University of Waterloo that Cufft is good for

larger sized fft implementation. Cufft starts to perform better than fftw around data sizes of

8192 elements. Therefore it can be safe to assume that Cufft works better than fftw for input

size greater than (~10,000 elements).

PROFILE SUMMARY of Matlab Code:

72%

It was found that out of all the functions in the matlab code provided to us, only recons_cs,

data_consistency_3d, RecPF_denoise were consuming 72% (548 secs /752secs) of the

timing. Therefore it must be sensible only to convert those functions into CUDA C.

FLOW CHART:

We Input the image file (256*232*88*32) into the GPU's global

Step 1 memory

Find the centre dimensions to determine the Critical image area and

Step 2 filter the image using Tukeywin filter.

Perform Inverse Fast fourier transform for all coils and sum them to

Step 3 3D size.

Step 4

Step 5

IMPLEMENTATION: The basics

We read our inputs from an asci file and we store the data in a cuComplex struct which is a

basic struct that has a two member variables x and y, representing the real and imaginart part

of the number respectively. Any sort of computation using cuComplex numbers must be

carried using some of the functions in the cuComplex.h library (the operators are not

overloaded to perform addition, multiplication etc).

We also use the cuFFT library which is part of the CUDA toolkit. cuFFT employs cuComplex

as it input data structure for any kind of FFT computation. cuFFT provides a method to do

both a forward and an inverse Fourier transform. We had to implement FFT shifts and FFT

inverse shifts ourselves.

from the input is squeezed along the X

dimension to form a 2 D image data, where

the most critical data lie at the centre of the

image. It is needed to choose the critical data

from the input image through Y- start, Yend, Z-start and Z-end. Initially, the start and

end pints in both the dimensions point to the

centre most pixel of the image. We determine

a rectangle around this point which consists

of all valid data.

Figure 4.Centre

dimension

This rectangle now contains the key image data that needs to be filtered and enhanced the

most, compared to other pixels in the image. To preserve the centre data and to reduce the

noise of surrounding elements, we filter the image using tukeywin filter. Sum the filtered

image along all 32 coils to get 3D input image of size 256x232x88.

The filtered 3D image is sent to de-noise function called RefPFdenoise and an image estimate

is found. Then the estimated 3D image is compared with 4D original image for all 32 coils and

the essential i.e. critical part of the image is preserved while the noise is cancelled out. We do

the above process for 25 times to completely eliminate the noise from the image as shown in

figure 1.

As GPU Engineers we are interested in parallel part of the algorithm. So we take the original

image and filtered image as our input. From figure 2 since the function RecPF denoise and

dataconsistency3d are taking much long compared to other functions. We are interested

implementing these two in GPU.

The second part of the code that could exploit its parallelism was the RecRF_denoise function.

We spent around a total of 175 seconds/3 minutes processing data in that function. This

function gets called 24 times (one less than the 3D data consistency part). So each function

call takes about 7.3 seconds. Our goal is to beat that number.

The RecRF Denoise Matlab code contained a lot of computations and library calls to both the

image processing toolkit and the signal processing toolkit. There are numerous FFT (Fourier

transform) computations and numerous finite difference computations along with

6

manipulation of input data. The input data is the single image produced after each cycle of the

main reconstruction algorithm (mainly the 3D data consistency part).

The RecRF_Denoise first initializes constants for its finite difference method using constant

memory and compiles that to the device. It also initializes optimization parameters into

constant memory as well. The values in constant memory are retained across kernel calls.

Then RecRF_Denoise starts by calculating initial parameters for the calculation loop which

are a denominator and a numerator, both which require a FFT to be performed.

Following those initializations, we start our main loop which is divided into a few parts. We

would have wanted to create only one kernel for the entire processing of data, but because we

need to call cuFFT on our data throughout, we must divide the calculations into parts, and save

our current states using global memory so we can resume with cuFFT calculation outputs in

our calculations. The loop runs three times and solves the three big parts of the MATLAB

code: the W sub problem, the U sub problem and ends with Bergman update of values for next

run. We write the brute force approach to the code and then start optimizing.

For some of the computations, we reordered the variables in order to minimize number of

variables used in the brute force approach (and in the Matlab code). By removing the extra

variables we were able to minimize memory consumption and increase our performance.

Additionally, parts of the Matlab code were not very efficient: for example data that was

calculated but not used, etc. We also optimized the code to remove those necessities.

For the Fourier Transform code, the implementation was pretty straight forward. For the finite

difference methods, we used code that was published on the NVIDIA blog [5] and modified it

to run without input size, and with cuComplex data. The code used shared memory and we

had to play around with shared memory size because initially with our input size, the compiler

declared that we were allocating more shared memory than the device could allocate.

Combining the code proved a little challenging (we started running into problems with the

finite differences when we tried to put them all in one kernel call) so we abandoned that idea

specifically, especially since for the recRF denoise part, there were problems with the

verification.

For the RecRF Denoise part, we started analysing the data against expected results (run by the

Matlab code) and we observed our output was entirely NaN (a problem we encountered when

running our Matlab initially with wrong configuration parameters). This led to an investigation

of which parts were the problem. Starting from the beginning we notice that the results for the

numerator and denominator are not the same as those from Matlab.

At the moment, the method provides all nan + j nan for all its outputs. Sample below for first

two expected elements: -2.483373e-10+j-1.389864e-09 1.088936e-10+j6.383455e-10.

Further work should go into looking at what is causing such a problem. We ran out of time to

continue to triage this.

PERFORMANCE

Initially we implemented a CUDA program to perform the RecRF_Denoise function, which

achieved a performance of 40 seconds over 10 iterations. We noticed later that the calculations

for averages were incorrect (we werent dividing by 10) and thus that gave us a performance

of about 4 seconds per each iteration of the loop. This is smaller than 7 seconds so we are

getting some performance improvements (not much yet though). After further analysis, we

discovered that building using nvcc was building in debug mode, and after creating a new

MakeFile and running the kernel, we started achieving a huge performance increase. Now we

are getting less than half a second for run time. Thats about a 14x improvement (not a x100

yet but getting some great improvement).

We tried to change block size of finite difference part but we noticed not much increase in

performance by changing the sizes (we were already using long stencil size in our shared

memory which was already optimized).

The memory cost is huge for RecRF_Denoise, besides the input and output data, we allocate

10 other matrices to store our data, for a total cost of 256*232*88*8*(10+2) = 0.501743616

Gigabytes. Its a little hard to calculate if we are bounded by memory or computation in our

case.

IMPLEMENTATION: dataconsistency3d

The source matlab function for dataconsistency3d function is shown below

sig_check = fftshift(fftn(ifftshift(img, 1)));

sig_check(picks) = sig(picks);

z = fftshift(ifftn(ifftshift(sig_check)), 1);

The main goal of this function is to retain the critical part of the image. During the RecPF denoise process all the image data are de-noised. Picks has the index of all critical parts of the

data. Sig is the original 4D data. Perform fft, replace with critical part, perform ifft and again

8

send it to RecPF denoise process. This process continues for 25 times. Since lots of 3D fft and

ifft are performed, cuda implementation will fasten the process.

Conversion of matlab funtions such as fftshift and ifftshift into c code is the first step. Kernel

code for fftshift and ifftshift are implemented. The following table shows the comparison time

taken of fftshift & ifftshift function to run in CPU vs GPU. Input size is same as 3d matrix size

256x232x88.

Table 1. FFT and IFFT shift function CPU vs GPU

Time

taken

execute in CPU

to Time

taken

execute in GPU

to Performance gain

Fftshift function

12.5ms

0.51ms

24.5x

Ifftshift function

12.945ms

0.753ms

17.19x

The fft and ifft functions are performed using cufft functions. The first step to perform a cufft

is to create a plan usinf cufftPlan function. A plan once created can be used later for subsequent

calls. We give the x,y,z dimensions to the plan and its type of computation say complex to

complex etc. The fft and ifft calls are made by executing the plan using cufftExec function.

The table below shows the time taken for dataconsistency3d function to execute in matlab and

GPU.

CUFFT Implementation in MATLAB :

cufftHandle plan_fft;

cufftPlan2d(&plan_fft, y,x,CUFFT_C2C);

cufftExecC2C(plan_fft, img_d, out_d, CUFFT_FORWARD);

cudaDeviceSynchronize();

CUFFT Implementation in MATLAB for IFFT:

cufftHandle plan_ifft;

cufftPlan2d(&plan_ifft, y,x,CUFFT_C2C);

cufftExecC2C(plan_ifft, img_d, out_d, CUFFT_INVERSE);

cudaDeviceSynchronize();

9

No of Coils

Used

dataconsistency3d

function(in MB)

Time taken to

execute in

MATLAB(t1 in

ms)

Time taken to

execute in

GPU(t2 in

ms)

Performance

gain = (t1/t2)

39.875

975

57

17.10526316

79.75

1437

103.927

13.82701319

159.5

2083

200

10.415

319

4106

385.53

10.65027365

16

638

7761

777.987

9.975745096

32

39.875

15356

1455

10.55395189

18000

16000

Time in ms

14000

12000

10000

8000

6000

4000

2000

0

39.875MB

79.75MB

159.5MB

319MB

638MB

1276MB

Input Size in MB

Time taken to execute in MATLAB(t1 in ms)

10

18

16

14

Speed up x

12

10

8

6

4

2

0

39.875MB

79.75MB

159.5MB

319MB

638MB

1276MB

Input Size in MB

Figure 6. Performance gain for various input sizes

In the MATLAB implementation, the time taken to compute the data consistency 3D function

increases drastically with increase in input size, reaching a maximum of about 16s for all the

32 coils. Whereas in the CUDA implementation, the maximum time is only 1.5s for all the 32

coils.

The FFT kernel is launched once for every coil. As the input coil size increases, the overhead

due to memory transfers between the host and the device memory also increases. Currently,

only 1 stream is being used for performing FFT / IFFT. Using multiple streams to pipeline the

tasks of memcopy and kernel execution can increase the performance further. We

experimented with multiple streams and software pipelining to further improve the throughput.

But, that causes output mismatch and we need to do further research to solve this issue. We

have attached all the implementation of the code in the Appendix for your reference.

VERIFICATION: dataconsistency3d

11

The Mean Square Error (MSE) and the Peak Signal to Noise Ratio (PSNR) are the two error

metrics used to compare image compression quality. The MSE represents the cumulative

squared error between the compressed and the original image, whereas PSNR represents a

measure of the peak error.

I0 is a known, perfect answer. This is typically found by performing the function in matlab.

The following table shows MSE and PSNR for different coil inputs.

Table 3. MSE and PSNR for various coils

Coils

Coil1

Coil2

Coil3

Coil4

Coil5

Coil6

Coil7

Coil8

Coil9

Coil10

Coil11

Coil12

Coil13

Coil14

Coil15

Coil16

Mean Square

Error

2.76485E-05

0.000262461

2.52628E-05

2.45329E-05

4.57182E-07

6.1467E-06

1.42437E-05

1.46217E-06

9.14289E-05

0.00010088

6.15868E-06

0.000287483

0.000110432

4.6884E-05

6.61326E-05

0.000187387

Maximum Io

0.8731

0.8297

0.8984

0.8798

0.9082

0.9304

0.9319

0.9719

0.97

0.9363

0.9399

0.9498

0.978

0.9709

0.9484

0.8771

PSNR(db)

44.40455724

34.18777834

45.0445775

44.99019378

62.56273853

51.48697426

47.85117212

58.10246753

40.12459774

39.39023257

51.56675881

34.96651801

39.37581573

43.03324413

41.33567204

36.13359831

12

Coil17

Coil18

Coil19

Coil20

Coil21

Coil22

Coil23

Coil24

Coil25

Coil26

Coil27

Coil28

Coil29

Coil30

Coil31

Coil32

8.53287E-06

9.77867E-06

1.49731E-06

1.22615E-05

3.36993E-05

2.44395E-05

7.40974E-05

4.89955E-05

0.000186808

4.91792E-05

1.26168E-05

5.07543E-05

7.33233E-05

2.85137E-06

1.08874E-05

9.10939E-06

0.9545

0.8778

0.9177

0.8416

0.9124

0.8917

0.8703

0.8472

0.8877

0.9376

0.9754

0.9687

0.9274

0.9467

0.9251

0.9596

50.28456843

48.9651119

57.50088891

47.6166772

43.92750306

45.12345043

40.09534734

41.65815335

36.25137087

42.52253555

48.77417303

42.66905996

40.69292106

54.97370418

48.95454347

50.04691095

70

PSNR in decibels

60

50

40

30

20

10

C1

C2

C3

C4

C5

C6

C7

C8

C9

C10

C11

C12

C13

C14

C15

C16

C17

C18

C19

C20

C21

C22

C23

C24

C25

C26

C27

C28

C29

C30

C31

C32

Coils

As can be seen from the graph above, an average PSNR of 45 dB is obtained from the output

of all coils.

13

CONCLUSION

The bottleneck functions in the MATLAB code for MRI Reconstruction were identified. Of

the two bottleneck functions, the RecPF denoise function did not exhibit any kind of

parallelism and works better when implemented sequentially. The parallel implementation of

the other function, data consistency 3d, however showed considerable speedup (10.55x) over

its sequential counterpart. Further improvements can be made by performing the cuda FFT /

IFFT of multiple coils simultaneously over parallel streams. The FFT and IFFT shift were

implemented in the native CPU as well as in CUDA GPU. The parallel implementation of

FFT shift is found to be 24x faster, while the IFFT shift is found to be 17x faster. Overall, we

experimented with multiple ways to improve the existing MATLAB code and were able to

successfully accelerate the computational speed.

REFERENCE

1. https://developer.nvidia.com/cufft.

2. GPU computing gems (Emerald Edition) Wen-mei W. Hwu, ISBN: 978-0-12384988-5.

3. http://developer.nvidia.com/object/matlab_cuda.html.

4. http://mri-q.com/index.html

5. http://devblogs.nvidia.com/parallelforall/finite-difference-methods-cuda-cc-part-1/

6. University of Waterloo. (2007).

http://www.science.uwaterloo.ca/hmerz/CUDA_benchFFT/

_global__ void cufftShift_2D_kernel(cuComplex* input, cuComplex* output,

int Nx, int Ny)

14

{

// 2D Slice & 1D Line

int sLine = Ny;

int sSlice = Nx * Ny;

// Transformations Equations

int sEq1 = (sSlice + sLine) / 2;

int sEq2 = (sSlice - sLine) / 2;

__syncthreads();

// Thread Index (1D)

int xThreadIdx = threadIdx.x;

int yThreadIdx = threadIdx.y;

__syncthreads();

// Block Width & Height

int blockWidth = blockDim.x;

int blockHeight = blockDim.y;

__syncthreads();

// Thread Index (2D)

int xIndex = blockIdx.x * blockWidth + xThreadIdx;

int yIndex = blockIdx.y * blockHeight + yThreadIdx;

__syncthreads();

// Thread Index Converted into 1D Index

int index = (yIndex * Nx) + xIndex;

__syncthreads();

if (xIndex < Nx / 2)

{

if (yIndex < Ny / 2)

{

// First Quad

output[index] = input[index + sEq1];

__syncthreads();

}

else

{

// Third Quad

output[index] = input[index - sEq2];

__syncthreads();

}

}

else

{

15

if (yIndex < Ny / 2)

{

// Second Quad

output[index] = input[index + sEq2];

__syncthreads();

}

else

{

// Fourth Quad

output[index] = input[index - sEq1];

__syncthreads();

}

}

}

__global__ void cuifftShift_2D_kernel(cuComplex* input, cuComplex* output, int

Nx, int Ny)

{

// 2D Slice & 1D Line

int sLine = Ny;

int sSlice = Nx * Ny;

// Transformations Equations

int sEq1 = (sSlice + sLine) / 2;

int sEq2 = (sSlice - sLine) / 2;

__syncthreads();

// Thread Index (1D)

int xThreadIdx = threadIdx.x;

int yThreadIdx = threadIdx.y;

__syncthreads();

// Block Width & Height

int blockWidth = blockDim.x;

int blockHeight = blockDim.y;

__syncthreads();

// Thread Index (2D)

int xIndex = blockIdx.x * blockWidth + xThreadIdx;

int yIndex = blockIdx.y * blockHeight + yThreadIdx;

__syncthreads();

// Thread Index Converted into 1D Index

int index = (yIndex * Nx) + xIndex;

__syncthreads();

16

if (xIndex < Nx / 2)

{

// First Half

output[index] = input[index + Nx/2];

__syncthreads();

}

else

{

// Second Half

output[index] = input[index - Nx/2];

__syncthreads();

}

}

cudaStream_t streams[4];

for(int i=0; i<8 ; i++) {

dim3 dimblock(16,16,1);

dim3 dimgrid(256/16,232*88/16,1);

cudaMemcpy(img_d, img_h+i*inputSize, inputSize*sizeof(cuComplex),

cudaMemcpyHostToDevice);

cuifftShift_2D_kernel<<<dimgrid,dimblock>>>(in_d_0,img_d,x,y);

cudaDeviceSynchronize();

cudaMemcpy(img_d, img_h+(i+1)*inputSize, inputSize*sizeof(cuComplex),

cudaMemcpyHostToDevice);

cuifftShift_2D_kernel<<<dimgrid,dimblock>>>(in_d_1,img_d,x,y);

cudaDeviceSynchronize();

cudaMemcpy(img_d, img_h+(i+2)*inputSize, inputSize*sizeof(cuComplex),

cudaMemcpyHostToDevice);

cuifftShift_2D_kernel<<<dimgrid,dimblock>>>(in_d_2,img_d,x,y);

cudaDeviceSynchronize();

cudaMemcpy(img_d, img_h+(i+3)*inputSize, inputSize*sizeof(cuComplex),

cudaMemcpyHostToDevice);

cuifftShift_2D_kernel<<<dimgrid,dimblock>>>(in_d_3,img_d,x,y);

cudaDeviceSynchronize();

cufftHandle* plans = (cufftHandle*) malloc(sizeof(cufftHandle)*2);

for (int i = 0; i < 4; i++) {

cufftPlan2d(&plans[i], y,x, CUFFT_C2C );

cufftSetStream(plans[i], streams[i]);

}

cufftExecC2C(plans[0],

cufftExecC2C(plans[1],

cufftExecC2C(plans[2],

cufftExecC2C(plans[3],

in_d_0,

in_d_1,

in_d_2,

in_d_3,

out_d_0,

out_d_1,

out_d_2,

out_d_3,

CUFFT_FORWARD);

CUFFT_FORWARD);

CUFFT_FORWARD);

CUFFT_FORWARD);

17

cudaStreamSynchronize(streams[i]);

cufftShift_2D_kernel<<<dimgrid,dimblock>>>(out_d_00,out_d_0,x,y);

cudaDeviceSynchronize();

cudaMemcpy(out_h+i*(inputSize), out_d_00, inputSize*sizeof(cuComplex),

cudaMemcpyDeviceToHost);

cufftShift_2D_kernel<<<dimgrid,dimblock>>>(out_d_11,out_d_1,x,y);

cudaDeviceSynchronize();

cudaMemcpy(out_h+(i+1)*inputSize, out_d_11, inputSize*sizeof(cuComplex),

cudaMemcpyDeviceToHost);

cufftShift_2D_kernel<<<dimgrid,dimblock>>>(out_d_22,out_d_2,x,y);

cudaDeviceSynchronize();

cudaMemcpy(out_h+(i+2)*inputSize, out_d_00, inputSize*sizeof(cuComplex),

cudaMemcpyDeviceToHost);

cufftShift_2D_kernel<<<dimgrid,dimblock>>>(out_d_33,out_d_3,x,y);

cudaDeviceSynchronize();

cudaMemcpy(out_h+ (i+3)*inputSize, out_d_00, inputSize*sizeof(cuComplex),

cudaMemcpyDeviceToHost);

{

out_h[picks_h[l]].x = sig_h[picks_h[l]].x;

out_h[picks_h[l]].y = sig_h[picks_h[l]].y;

}

cudaMemcpyHostToDevice);

cudaMemcpy(out_d_11, out_h + (i+1)*inputSize, inputSize*sizeof(cuComplex),

cudaMemcpyHostToDevice);

cufftShift_2D_kernel<<<dimgrid,dimblock>>>(in_d_0,out_d_00,x,y);

cudaDeviceSynchronize();

cufftShift_2D_kernel<<<dimgrid,dimblock>>>(in_d_1,out_d_11,x,y);

cudaDeviceSynchronize();

cudaMemcpy(out_d_22, out_h+ (i+2)*inputSize, inputSize*sizeof(cuComplex),

cudaMemcpyHostToDevice);

cudaMemcpy(out_d_33, out_h + (i+3)*inputSize, inputSize*sizeof(cuComplex),

cudaMemcpyHostToDevice);

cufftShift_2D_kernel<<<dimgrid,dimblock>>>(in_d_2,out_d_22,x,y);

cudaDeviceSynchronize();

cufftShift_2D_kernel<<<dimgrid,dimblock>>>(in_d_3,out_d_33,x,y);

cudaDeviceSynchronize();

cufftExecC2C(plans[0], in_d_0, out_d_0, CUFFT_INVERSE);

cufftExecC2C(plans[1], in_d_1, out_d_1, CUFFT_INVERSE);

cufftExecC2C(plans[0], in_d_2, out_d_2, CUFFT_INVERSE);

cufftExecC2C(plans[1], in_d_3, out_d_3, CUFFT_INVERSE);

for(int i = 0; i < 4; i++)

cudaStreamSynchronize(streams[i]);

18

cuifftShift_2D_kernel<<<dimgrid,dimblock>>>(out_d,out_d_0,x,y);

cudaDeviceSynchronize();

cudaMemcpy(out_h+i*inputSize, out_d, inputSize*sizeof(cuComplex),

cudaMemcpyDeviceToHost);

cuifftShift_2D_kernel<<<dimgrid,dimblock>>>(out_d,out_d_1,x,y);

cudaDeviceSynchronize();

cudaMemcpy(out_h+(i+1)*inputSize, out_d, inputSize*sizeof(cuComplex),

cudaMemcpyDeviceToHost);

cuifftShift_2D_kernel<<<dimgrid,dimblock>>>(out_d,out_d_2,x,y);

cudaDeviceSynchronize();

cudaMemcpy(out_h+(i+2)*inputSize, out_d, inputSize*sizeof(cuComplex),

cudaMemcpyDeviceToHost);

cuifftShift_2D_kernel<<<dimgrid,dimblock>>>(out_d,out_d_3,x,y);

cudaDeviceSynchronize();

cudaMemcpy(out_h+(i+3)*inputSize, out_d, inputSize*sizeof(cuComplex),

cudaMemcpyDeviceToHost);

19

- convolutionUploaded byAbhishek Kumar
- Petrel 2014 Installation GuideUploaded byHazel Fransiskus Sitanggang
- FFTUploaded bystarman222
- Vulkan KleinUploaded byMaik Klein
- A complex project build by cmakeUploaded byphilo10
- An advancement to the security level through the Galois field in the existing password based approach of hiding classified information in imagesUploaded byAnonymous vQrJlEN
- Quality Assurance for MRI by AAPMUploaded byNikos
- 14 15 Little Red Book of NeurologyUploaded byeloy
- 20120140505010Uploaded byIAEME Publication
- Imageenhancement.pdfUploaded byMohammad Gulam Ahamad
- CUDA Developer Guide for Optimus PlatformsUploaded bylolo406
- Simegnew YihunieUploaded byWubie
- Mr Simon Moyes - Osteochondral lesions of the talusUploaded bySkylark Creative
- Measuring disease activity in Crohn’s disease: what is currently available to the clinicianUploaded bypancholarpancholar
- Brain Abscess Seminar 2000Uploaded byVI Vqd
- 4513sipij02.pdfUploaded byShreekanth Shenoy
- IJEST12-04-12-043.pdfUploaded byPhuc Hoang
- 59_HowThingsWorkUploaded bymaverick4mll
- VolumeIUploaded byMarco André Argenta
- ECE472 Sample Fin Project FFT Synthesis-3Uploaded byDanghung Ta
- New DirectX 12 Features in Windows 10 Fall Creators UpdateUploaded byRolex
- faUploaded byjbusowicz
- Audio CompressionUploaded bytseruga1
- Fft RabbitUploaded byLeiser Hartbeck
- BOSS CPMP PresentationUploaded byMasterHomer
- Sensitive Online PD-Measurements of Onsite OilPaper-Insulated Devices by Means of Optimized Acoustic Emission Techniques (AET)Uploaded byrajamohan_jj
- 1-s2.0-S1808869415001330-mainUploaded byAchmad Yunus
- Lecture 3Uploaded byAdal Arasu
- Pres in Children - Ignacio DelgadoUploaded byFlavia Lița
- A Formal Definition of DFG ModelsUploaded byhồb_3

- Barkhausen CriterionUploaded byrammar147
- HistoryUploaded byGokul Subramani
- Technology bookUploaded byGokul Subramani
- Probablity problemsUploaded byGokul Subramani
- syllabusUploaded byGokul Subramani
- 4Choice Based Credit SystemUploaded byGokul Subramani
- Pal a Charla 1997Uploaded byGokul Subramani
- ReadmeUploaded byGokul Subramani
- DextrousUploaded byGokul Subramani
- Dc VoltmeterUploaded byGokul Subramani
- Cellbios CopyUploaded byGokul Subramani
- M18004 - T5-1Uploaded byGokul Subramani
- Ammeters _Sanda Brand Single.docxUploaded byGokul Subramani
- Agriculture biogas plants utilise organic materials found on farms to generate biogas.docxUploaded byGokul Subramani
- M18004 - T5-1Uploaded byGokul Subramani
- Treatment of Slaughterhouse waterUploaded byAwais Jalali
- Microchip 18F877 DatasheetUploaded byberety
- Am is Param TersUploaded bymanaj_mohapatra2041
- 4Uploaded byGokul Subramani
- OP177Uploaded byrahms79
- Mark Sheet Format NewUploaded byGokul Subramani
- Sathya ResumeUploaded byGokul Subramani

- 1-Image Forgery Detection - Final ReportUploaded byAli Maarouf
- 1-s2.0-S0889974605001714-mainUploaded byMilan Nayek
- Fast Fourier TransformUploaded byfm_temp
- Ece-Vii-dsp Algorithms & Architecture [10ec751]-NotesUploaded byrass
- d.s.p File 6th SemUploaded bynitish_singh_1
- Chapter3 DFT FFT z Transform FIR FiltersUploaded byPranay Arya
- MFCC Extraction MethodUploaded byDeepak123
- Transfer-Function Measurement with Sweeps.pdfUploaded byHector Herrera Chavez
- Generation and Analysis of Random WavesUploaded byjarodakd
- 16BEC0075_VL2018191001619_AST04 (1)Uploaded byashish
- DIT radix-2Uploaded bySuha Nori
- lecture6_2D_DFTUploaded byzsyed92
- FhtUploaded byAquiles Brinco
- Matlab CodesUploaded byhazoorbukhsh
- The Excel FFT Function v1.1Uploaded byLiviu Stefanescu
- PeterSutor_HonorsThesis.pdfUploaded byBoppidiSrikanth
- EEG SIGNAL CLASSIFICATIONUploaded byAhmad Reza Musthafa
- Advanced Beam Analysis for Integrated ModellingUploaded bynrmahesh
- Ch5 Mitra Dsp 2pUploaded byRohan Gaonkar
- Abnormality Detection in ECG Signal Using Wavelets and Fourier TransformUploaded byAJER JOURNAL
- Lecture 1 - Linear and Non-Linear Wavelet ApproximationUploaded byJuan Manuel Mauro
- NI Tutorial 3115 en (1)Uploaded byajith.ganesh2420
- cyclic convolution.pdfUploaded byGopalakrishnan
- Fpga Frequency EstimationUploaded byJulius Teo
- FftUploaded byBryan Navarro Ojeda
- Adaptive MIMO-OFDM Scheme with Reduced Computational Complexity and Improved CapacityUploaded byijcsis
- Comparison of OFDM and Single Carrier Transmission in PLCUploaded byThinkPad
- Kuo-Zuo-koon K Out NUploaded byKenshin Himura
- Is It a Mode Shape, Or an Operating Deflection ShapeUploaded byDew Dew Opa
- 222.pdfUploaded byHleli Rabie