CUDA

CUFFT Library

PG-05327-032_V01 August, 2010

CUFFT Library

PG-05327-032_V01

Published by NVIDIA Corporation 2701 San Tomas Expressway Santa Clara, CA 95050 Notice ALL NVIDIA DESIGN SPECIFICATIONS, REFERENCE BOARDS, FILES, DRAWINGS, DIAGNOSTICS, LISTS, AND OTHER DOCUMENTS (TOGETHER AND SEPARATELY, ““MATERIALS””) ARE BEING PROVIDED ““AS IS””. NVIDIA MAKES NO WARRANTIES, EXPRESSED, IMPLIED, STATUTORY, OR OTHERWISE WITH RESPECT TO THE MATERIALS, AND EXPRESSLY DISCLAIMS ALL IMPLIED WARRANTIES OF NONINFRINGEMENT, MERCHANTABILITY, AND FITNESS FOR A PARTICULAR PURPOSE. Information furnished is believed to be accurate and reliable. However, NVIDIA Corporation assumes no responsibility for the consequences of use of such information or for any infringement of patents or other rights of third parties that may result from its use. No license is granted by implication or otherwise under any patent or patent rights of NVIDIA Corporation. Specifications mentioned in this publication are subject to change without notice. This publication supersedes and replaces all information previously supplied. NVIDIA Corporation products are not authorized for use as critical components in life support devices or systems without express written approval of NVIDIA Corporation. Trademarks NVIDIA, CUDA, and the NVIDIA logo are trademarks or registered trademarks of NVIDIA Corporation in the United States and other countries. Other company and product names may be trademarks of the respective companies with which they are associated. Copyright © 2005––2010 by NVIDIA Corporation. All rights reserved.

NVIDIA Corporation

. . . . . . . . . . . . . . . . . . . . . .. . . . . .. . . .. . . . ... .. . . . . .. . .. . . . . . .. . . . . . . . Type cufftCompatibility .. .. . . . .. . . .. . .. . .. . . . . . . . . . . . . ... . 3D Complex-to-Complex Transforms . . . . .. . . . . . ... . . . . . . . . . . . . . . . . . . . . . . Function cufftSetStream() . . . . . . . . . . . . . . .. . . 1D Complex-to-Complex Transforms . . . .. . . . . . . ... . .. . . . . .. . . . . .. . . . .. .. .. . . . .. . .. . . . . . . . . . . . . . . . . . . . .... . . .. . . . . . . . . .. ... . . CUFFT API Functions . . . . 5 5 6 6 6 6 7 7 7 8 9 9 11 12 12 13 14 15 15 16 17 18 19 19 20 21 23 24 25 26 27 28 29 Accuracy and Performance . . . . . Function cufftSetCompatibilityMode() .. .. . . . .. ... . . . . . . . . . . . . .. .. . . . . . .. . . .. . . 2D Complex-to-Real Transforms . . . . . Function cufftExecZ2D() . Function cufftDestroy() .. .. . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . .... . . . .. . . . . .. . . . . .. . . . . .. .. . . Batched 2D Complex-to-Complex Transforms. . .Table of Contents CUFFT Library . . . . . .. .. . . . . . .. . . .. . .. . . . . . . . . . ... . . . . .. . . . . .. . . . . . Function cufftExecD2Z() .. . . .. . . . . . . . . .. . . .. . . . . . .. Type cufftDoubleReal . . .. . . . . . . . . . . .. . . . .. . . .. . .. . . . . . . . .. . FFTW Compatibility Mode . . .. . . . . . . . .. . . . . . CUFFT Transform Types. . . . ... . . . . . . . . . . . . . . .. . . .. . . . Streamed CUFFT Transforms . . . . . . .. . .. . .. . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . .. . . . . .. .. . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . ... . . . . . . ... . . . . . ... . . . . . . . . Function cufftExecR2C() . . . . . . . . . . . . .. . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . .. . . .. . .. . ... . .. . . . . . . . . . . . .. . . . . . .. . . . . . . . .. . . . . . . . . . . . . . . . . . . .. . . . . . . . ... . . . . .. Function cufftExecC2R() ... . . . .. .. . . .. . . . . . . . . . . . . ..... .. . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . .. ... . ... .. . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . ... ... . . . . . .. .. . . . . . . . . . . . . . .. . .. . . . . . . .. . . . .. . .. . . . . . . . . . .. . . . . .. . .. . . . . Function cufftExecC2C() . . . . . . . . . . . . . .. . . . . .. .. . . . . .. . . . . . . . . . . . . . . . . . . . . . . .. . .. . .. . . . . . . .. .. . . . . . . . .. . . . . . . . . . . . . . . . .. . . . . . . . . . . .. . . . . .. . . . . Type cufftHandle . . . . . .. . . . . . . . . . . .. . . . .. . . .. . . .. . . . . . .. Function cufftPlan3d().. . .. . . . . ... PG-05327-032_V01 NVIDIA iii . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . .. . . . .. . . 2D Complex-to-Complex Transforms . ... . . . .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . CUFFT Transform Directions. . . . . . . . . . . . . . . . . . Type cufftComplex . . . . . . . .. . . . . . . . . . . . .. . .. .. .. . . . .. . . . . . . .. . .... . . . . . . . . . . . . . . . . . . . . . . . . . . .. .. . . . . . . . .. . . .. . . . . . .. . . . Type cufftDoubleComplex . ..... . . . . .. . .. Function cufftPlanMany(). . . . . . 4 CUFFT Types and Definitions . ... . . . .. . . . . . . . Function cufftPlan1d(). . . . . .. . . . . . . . . . . . . 22 CUFFT Code Examples.. . .. . . . . .. Type cufftReal .. ... ... . . .. . . . .. . . . .. .. . . . . . .. . . . ... . . . . . . Function cufftPlan2d(). . . . . . .. . . . . . . . . . . . . 1D Real-to-Complex Transforms .. . ... . . . . . . . . . . .. . . . . .. . . . . . . . . . . .. .. .. . . . .. . . . . . . . .. . . . . . . . . . .. . . . . Function cufftExecZ2Z() .. . .. . . .. . . . . .. Type cufftResult ... . . .. . . ..

some libraries only implement Radix 2 FFTs. FFT libraries typically vary in terms of supported transform sizes and data types. and it is one of the most important and widely used numerical algorithms. and 3D transforms of complex and real valued data Batch execution for doing multiple transforms of any dimension in parallel 2D and 3D transform sizes in the range [2.CUFFT Library This document describes CUFFT. The CUFFT library provides a simple interface for computing parallel FFTs on an NVIDIA GPU. 2D. GPU based FFT implementation. The FFT is a divide and conquer algorithm for efficiently computing discrete Fourier transforms of complex or real valued data sets. while other implementations support arbitrary transform sizes. the NVIDIA® CUDA™™ Fast Fourier Transform (FFT) library. This version of the CUFFT library supports the following features: 1D. restricting the transform size to a power of two. 16384] in any dimension 1D transform sizes up to 8 million elements In place and out of place transforms for real and complex data Double precision transforms on compatible hardware (GT200 and later GPUs) Support for streamed execution. enabling simultaneous computation together with data movement PG-05327-032_V01 NVIDIA 4 . with applications that include computational physics and general signal processing. which allows users to leverage the floating point power and parallelism of the GPU without having to develop a custom. For example.

For example.CUDA CUFFT Library CUFFT Types and Definitions The next sections describe the CUFFT types and transform directions: ““Type cufftHandle”” on page 5 ““Type cufftResult”” on page 6 ““Type cufftReal”” on page 6 ““Type cufftDoubleReal”” on page 6 ““Type cufftComplex”” on page 6 ““Type cufftDoubleComplex”” on page 7 ““Type cufftCompatibility”” on page 7 ““CUFFT Transform Types”” on page 7 ““CUFFT Transform Directions”” on page 8 Type cufftHandle typedef unsigned int cufftHandle. A handle type used to store and access CUFFT plans (see ““CUFFT API Functions”” on page 11 for more information about plans). PG-05327-032_V01 NVIDIA 5 . the user receives a handle after creating a CUFFT plan and uses this handle to execute the plan.

CUFFT failed to execute an FFT on the GPU. Input or output does not satisfy texture alignment requirements. The CUFFT library failed to initialize. The user specifies an unsupported FFT size. The user requests an unsupported type. floating point complex data type that consists of interleaved real and imaginary components. Used for all internal driver errors. CUFFT is passed an invalid plan handle. floating point real data type. The possible return values are defined as follows: Return Values CUFFT_SUCCESS CUFFT_INVALID_PLAN CUFFT_ALLOC_FAILED CUFFT_INVALID_TYPE CUFFT_INVALID_VALUE CUFFT_INTERNAL_ERROR CUFFT_EXEC_FAILED CUFFT_SETUP_FAILED CUFFT_INVALID_SIZE CUFFT_UNALIGNED_DATA Any CUFFT operation is successful. Type cufftComplex typedef cuComplex cufftComplex. Type cufftReal typedef float cufftReal. A single precision. Type cufftDoubleReal typedef double cufftDoubleReal. A single precision. CUFFT failed to allocate GPU memory. 6 NVIDIA PG-05327-032_V01 . floating point real data type. An enumeration of values used exclusively as API function return values. The user specifies a bad memory pointer. A double precision.CUDA CUFFT Library Type cufftResult typedef enum cufftResult_t cufftResult.

The cufftType data type is an enumeration of the types of transform data supported by CUFFT: typedef enum cufftType_t { CUFFT_R2C = 0x2a. For higher dimensional real transforms of the form N0 N1 Nn . // Double complex to double CUFFT_Z2Z = 0x69 // Double complex to double complex } cufftType. Type cufftCompatibility typedef enum cufftCompatibility_t cufftCompatibility. the output array holds N 2 + 1 cufftComplex terms. // Complex to complex. So for an N element transform.CUDA CUFFT Library Type cufftDoubleComplex typedef cuDoubleComplex cufftDoubleComplex. The transform size in each dimension is the number of cufftComplex elements. the output array holds only the non redundant complex coefficients. A double precision. // Complex (interleaved) to real CUFFT_C2C = 0x29. interleaved CUFFT_D2Z = 0x6a. Pass the CUFFT_Z2Z constant to configure a double precision complex to complex FFT. CUFFT Transform Types The CUFFT library supports complex and real data transforms. // Double to double complex CUFFT_Z2D = 0x6c. An enumeration of values used to control FFTW data compatibility. For complex FFTs. the input and output arrays must interleave the real and imaginary parts (the cufftComplex type). floating point complex data type that consists of interleaved real and imaginary components. The CUFFT_C2C constant can be passed to any plan creation function to configure a single precision complex to complex FFT. See ““FFTW Compatibility Mode”” on page 9 for details. the last dimension is cut in half such that the output data is PG-05327-032_V01 7 NVIDIA . For real to complex FFTs. // Real to complex (interleaved) CUFFT_C2R = 0x2c.

batched transforms are supported through the cufftPlanMany() function. CUFFT Transform Directions The CUFFT library defines forward and inverse Fast Fourier Transforms according to the sign of the complex exponential term: #define CUFFT_FORWARD #define CUFFT_INVERSE 1 1 8 NVIDIA PG-05327-032_V01 . as of version 3. the input array holds only the non redundant. Note that the real to complex transform is implicitly forward. in order to perform an in place FFT. input and output strides match the logical transform size N and the non redundant size N 2 + 1 . the distance between signals in a batch depends on whether the transform is in place or out of place. However. respectively.CUDA CUFFT Library N0 N1 Nn 2 + 1 complex elements. For 1D complex to complex transforms. the user has to pad the input array in the last dimension to Nn 2 + 1 complex elements or 2 * N 2 + 1 real elements. Passing the CUFFT_D2Z constant configures a double precision real to complex FFT.0 it is assumed the data for each signal within the batch immediately follow the data of the previous one (a stride of 1). the stride between signals in a batch is assumed to be the number of cufftComplex elements in the logical transform size. In this case. The requirements for complex to real FFTs are similar to those for real to complex. The output is simply N elements of type cufftReal. for an in place transform. Passing the CUFFT_C2R constant to any plan creation function configures a single precision complex to real FFT. Therefore. for real data FFTs. the input stride is assumed to be 2 * N 2 + 1 cufftReal elements or N 2 + 1 cufftComplex elements. Although this function takes input parameters that specify input and output data strides. the input size must be padded to 2 * N 2 + 1 real elements. The complex to real transform is implicitly inverse. Passing the CUFFT_R2C constant to any plan creation function configures a single precision real to complex FFT. For out of place transforms. Passing CUFFT_Z2D constant configures a double precision complex to real FFT. For in place FFTs. N 2 + 1 complex coefficients from a real to complex transform. However.0. Starting with CUFFT version 3.

If no stream is associated with a plan. Non power of 2 sizes will continue to use the same padding layout as FFTW. launches take place in stream 0 (the default CUDA stream). PG-05327-032_V01 NVIDIA 9 . For example.fftw. These steps may include multiple kernel launches. When native mode is selected for this function.org. one can disable FFTW compatible layout using cufftSetCompatibilityMode(). Streamed CUFFT Transforms Execution of a transform of a particular size and type may take several stages of processing. FFTW Compatibility Mode For some transform sizes. The user can configure column major FFTs by simply changing the order of the size parameters to the plan creation API functions. memory copies. CUFFT performs un normalized FFTs. all launches of the internal stages of that plan take place through the specified stream. power of 2 transform sizes will be compact and CUFFT will not use padding. Y. Streaming of launches allows for potential overlap between transforms and memory copies——see the NVIDIA CUDA Programming Guide for more information on streams. FFTW requires additional padding bytes between rows and planes of Real2Complex (R2C) and Complex2Real (C2R) transforms of rank greater than 1. Every CUFFT plan may be associated with a CUDA stream. Once so associated. please refer to the FFTW online documentation at http://www. (For details. Scaling either transform by the reciprocal of the size of the data set is left for the user to perform as seen fit. introduced in release 3.1 and described on page 21. CUFFT transforms along Z. in which CUFFT specifies the internal steps that need to be taken.CUDA CUFFT Library For higher dimensional transforms (2D and 3D). Y. and then X. performing a forward FFT on an input data set followed by an inverse FFT on the resulting set yields data that is equal to the input scaled by the number of elements. that is. and Z. if the user requests a 3D transform plan for sizes X.) To speed up R2C and C2R transforms for power of 2 sizes similar to their Complex2Complex (C2C) equivalent. A plan for the transform is generated. and so on. CUFFT performs FFTs in row major or C order.

it guarantees FFTW compatible output for non symmetric complex inputs for transforms with power of 2 size. CUFFT_COMPATIBILITY_FFTW_ASYMMETRIC waives the C2R symmetry requirement. CUFFT_COMPATIBILITY_FFTW_PADDING supports FFTW data padding by inserting extra padding between packed in place transforms for batched transforms with power of 2 size. This is only useful for artificial (that is. 10 NVIDIA PG-05327-032_V01 . Refer to the FFTW documentation (http://www. CUFFT_COMPATIBILITY_FFTW_ALL enables full FFTW compatibility. but achieves the highest performance.CUDA CUFFT Library The FFTW compatibility modes are as follows: CUFFT_COMPATIBILITY_NATIVE CUFFT_COMPATIBILITY_FFTW_PADDING CUFFT_COMPATIBILITY_FFTW_ASYMMETRIC CUFFT_COMPATIBILITY_FFTW_ALL CUFFT_COMPATIBILITY_NATIVE mode disables FFTW compatibility. Once set. Enabling this mode can significantly impact performance.org) for FFTW data layout specifications.fftw. random) data sets as actual data will always be symmetric if it has come from the real plane.

and plans are a simple way to store and reuse configurations. The CUFFT library initializes internal data upon the first invocation of an API function. The advantage of this approach is that once the user creates a plan. Therefore. The FFTW model works well for CUFFT because different kinds of FFTs require different thread configurations and GPU resources.CUDA CUFFT Library CUFFT API Functions The CUFFT API is modeled after FFTW. all API functions could return the CUFFT_SETUP_FAILED error code if the library fails to initialize. the minimum floating point operation (flop)—— plan of execution for a particular FFT size and data type. FFTW provides a simple configuration mechanism called a plan that completely specifies the optimal——that is. The CUFFT functions are as follows: ““Function cufftPlan1d()”” on page 12 ““Function cufftPlan2d()”” on page 12 ““Function cufftPlan3d()”” on page 13 ““Function cufftPlanMany()”” on page 14 ““Function cufftDestroy()”” on page 15 ““Function cufftExecC2C()”” on page 15 ““Function cufftExecR2C()”” on page 16 ““Function cufftExecC2R()”” on page 17 ““Function cufftExecZ2Z()”” on page 18 ““Function cufftExecD2Z()”” on page 19 ““Function cufftExecZ2D()”” on page 19 ““Function cufftSetStream()”” on page 20 ““Function cufftSetCompatibilityMode()”” on page 21 PG-05327-032_V01 NVIDIA 11 . CUFFT shuts down automatically when all user created FFT plans are destroyed. which is one of the most popular and efficient CPU based FFT libraries. the library stores whatever state is needed to execute the plan multiple times without recalculation of the configuration.

int batch ). Creates a 1D FFT plan configuration for a specified signal size and data type.CUDA CUFFT Library Function cufftPlan1d() cufftResult cufftPlan1d( cufftHandle *plan. CUFFT library failed to initialize.. Input plan nx type batch Pointer to a cufftHandle object The transform size (e.g.g. Return Values CUFFT_SUCCESS CUFFT_ALLOC_FAILED CUFFT_INVALID_TYPE CUFFT_INTERNAL_ERROR CUFFT_SETUP_FAILED CUFFT_INVALID_SIZE Function cufftPlan2d() cufftResult cufftPlan2d( cufftHandle *plan. Input plan nx ny type Pointer to a cufftHandle object The transform size in the X dimension (number of rows) The transform size in the Y dimension (number of columns) The transform data type (e. CUFFT_C2C for complex to complex) Number of transforms of size nx Output plan Contains a CUFFT 1D plan handle value CUFFT successfully created the FFT plan. int nx. Internal driver error is detected. The nx parameter is not a supported size. cufftType type. Allocation of GPU resources for the plan failed. This function is the same as cufftPlan1d() except that it takes a second size parameter. CUFFT_C2R for complex to real) PG-05327-032_V01 12 NVIDIA ..g. int nx. int ny. and does not support batching. The type parameter is not supported. 256 for a 256 point FFT) The transform data type (e. ny.. Creates a 2D FFT plan configuration according to specified signal sizes and data type. The batch input parameter tells CUFFT how many 1D transforms to configure. cufftType type ).

g. The nx parameter is not a supported size. CUFFT library failed to initialize. CUFFT_R2C for real to complex) Contains a CUFFT 3D plan handle value CUFFT successfully created the FFT plan. Internal driver error is detected. cufftType type ). Output plan Return Values CUFFT_SUCCESS CUFFT_ALLOC_FAILED CUFFT_INVALID_TYPE CUFFT_INTERNAL_ERROR CUFFT_SETUP_FAILED CUFFT_INVALID_SIZE PG-05327-032_V01 NVIDIA 13 . Internal driver error is detected. This function is the same as cufftPlan2d() except that it takes a third size parameter nz.CUDA CUFFT Library Output plan Contains a CUFFT 2D plan handle value CUFFT successfully created the FFT plan.. The type parameter is not supported. Return Values CUFFT_SUCCESS CUFFT_ALLOC_FAILED CUFFT_INVALID_TYPE CUFFT_INTERNAL_ERROR CUFFT_SETUP_FAILED CUFFT_INVALID_SIZE Function cufftPlan3d() cufftResult cufftPlan3d( cufftHandle *plan. Allocation of GPU resources for the plan failed. int nx. The nx parameter is not a supported size. int ny. CUFFT library failed to initialize. Allocation of GPU resources for the plan failed. Creates a 3D FFT plan configuration according to specified signal sizes and data type. The type parameter is not supported. int nz. Input plan nx ny nz type Pointer to a cufftHandle object The transform size in the X dimension The transform size in the Y dimension The transform size in the Z dimension The transform data type (e.

cufftType type. as per other CUFFT calls) Batch size for this transform Contains a CUFFT plan handle CUFFT successfully created the FFT plan. Output plan Return Values CUFFT_SUCCESS CUFFT_ALLOC_FAILED CUFFT_INVALID_TYPE CUFFT_INTERNAL_ERROR 14 NVIDIA PG-05327-032_V01 . int odist. Creates a FFT plan configuration of dimension rank. Input plan rank n inembed istride idist onembed ostride odist type batch Pointer to a cufftHandle object Dimensionality of the transform (1. int idist. int rank.. Internal driver error is detected. The batch input parameter tells CUFFT how many transforms to configure in parallel.CUDA CUFFT Library Function cufftPlanMany() cufftResult cufftPlanMany( cufftHandle *plan. or 3) An array of size rank. int batch ). int *onembed. int *inembed. int istride. int ostride. Input parameters inembed. With this function. CUFFT_C2C.g. The type parameter is not supported. and idist and output parameters onembed. describing the size of each dimension Unused: pass NULL Unused: pass 1 Unused: pass 0 Unused: pass NULL Unused: pass 1 Unused: pass 0 Transform data type (e. int *n. Note that for the current version of CUFFT. ostride. and odist will allow setup of non contiguous input data in a future version. these parameters are ignored and the layout of batched data must be side by side and not interleaved. with sizes specified in the array n. batched plans of any dimension may be created. istride. Allocation of GPU resources for the plan failed. 2.

CUFFT successfully created the FFT plan. CUFFT library failed to initialize. This function should be called once a plan is no longer needed to avoid wasting GPU memory. Frees all GPU resources associated with a CUFFT plan and destroys the internal plan data structure. Executes a CUFFT single precision complex to complex transform plan as specified by direction. The nx parameter is not a supported size. cufftComplex *idata. this method does an in place transform. int direction ).CUDA CUFFT Library Return Values (continued) CUFFT_SETUP_FAILED CUFFT_INVALID_SIZE CUFFT library failed to initialize. This function stores the Fourier coefficients in the odata array. cufftComplex *odata. CUFFT uses as input data the GPU memory pointed to by the idata parameter. Input plan The cufftHandle object of the plan to be destroyed. If idata and odata are the same. Function cufftDestroy() cufftResult cufftDestroy( cufftHandle plan ). The plan parameter is not a valid handle. Return Values CUFFT_SUCCESS CUFFT_INVALID_PLAN CUFFT_SETUP_FAILED Function cufftExecC2C() cufftResult cufftExecC2C( cufftHandle plan. Input plan idata odata direction The cufftHandle object for the plan to update Pointer to the single precision complex input data (in GPU memory) to transform Pointer to the single precision complex output data (in GPU memory) The transform direction: CUFFT_FORWARD or CUFFT_INVERSE PG-05327-032_V01 NVIDIA 15 .

Input or output does not satisfy texture alignment requirements. Return Values CUFFT_SUCCESS CUFFT_INVALID_PLAN 16 NVIDIA PG-05327-032_V01 . CUFFT uses as input data the GPU memory pointed to by the idata parameter. Executes a CUFFT single precision real to complex (implicitly forward) transform plan. CUFFT failed to execute the transform on GPU. Return Values CUFFT_SUCCESS CUFFT_INVALID_PLAN CUFFT_INVALID_VALUE CUFFT_INTERNAL_ERROR CUFFT_EXEC_FAILED CUFFT_SETUP_FAILED CUFFT_UNALIGNED_DATA Function cufftExecR2C() cufftResult cufftExecR2C( cufftHandle plan. Internal driver error is detected. cufftReal *idata. this method does an in place transform (See ““CUFFT Transform Types”” on page 7 for details on real data FFTs. The plan parameter is not a valid handle. This function stores the non redundant Fourier coefficients in the odata array. If idata and odata are the same. cufftComplex *odata ). The idata. and/or direction parameter is not valid. The plan parameter is not a valid handle. odata.CUDA CUFFT Library Output odata Contains the complex Fourier coefficients CUFFT successfully created the FFT plan.) Input plan idata odata The cufftHandle object for the plan to update Pointer to the single precision real input data (in GPU memory) to transform Pointer to the single precision complex output data (in GPU memory) Output odata Contains the complex Fourier coefficients CUFFT successfully created the FFT plan. CUFFT library failed to initialize.

If idata and odata are the same. cufftComplex *idata. This function stores the real output values in the odata array. odata. Internal driver error is detected. this method does an in place transform. cufftReal *odata ). Output odata Return Values CUFFT_SUCCESS CUFFT_INVALID_PLAN CUFFT_INVALID_VALUE CUFFT_INTERNAL_ERROR CUFFT_EXEC_FAILED PG-05327-032_V01 NVIDIA 17 . (See ““CUFFT Transform Types”” on page 7 for details on real data FFTs. odata.) Input plan idata odata The cufftHandle object for the plan to update Pointer to the single precision complex input data (in GPU memory) to transform Pointer to the single precision real output data (in GPU memory) Contains the real valued output data CUFFT successfully created the FFT plan. and/or direction parameter is not valid.CUDA CUFFT Library Return Values (continued) CUFFT_INVALID_VALUE CUFFT_INTERNAL_ERROR CUFFT_EXEC_FAILED CUFFT_SETUP_FAILED CUFFT_UNALIGNED_DATA The idata. CUFFT failed to execute the transform on GPU. Internal driver error is detected. The idata. Executes a CUFFT single precision complex to real (implicitly inverse) transform plan. CUFFT library failed to initialize. CUFFT failed to execute the transform on GPU. and/or direction parameter is not valid. Input or output does not satisfy texture alignment requirements. CUFFT uses as input data the GPU memory pointed to by the idata parameter. Function cufftExecC2R() cufftResult cufftExecC2R( cufftHandle plan. The input array holds only the non redundant complex Fourier coefficients. The plan parameter is not a valid handle.

and/or direction parameter is not valid. Function cufftExecZ2Z() cufftResult cufftExecZ2Z( cufftHandle plan. Executes a CUFFT double precision complex to complex transform plan as specified by direction. Input or output does not satisfy texture alignment requirements. int direction ). odata. CUFFT uses as input data the GPU memory pointed to by the idata parameter. Input or output does not satisfy texture alignment requirements. The plan parameter is not a valid handle.CUDA CUFFT Library Return Values (continued) CUFFT_SETUP_FAILED CUFFT_UNALIGNED_DATA CUFFT library failed to initialize. cufftDoubleComplex *idata. cufftDoubleComplex *odata. CUFFT failed to execute the transform on GPU. Input plan idata odata direction The cufftHandle object for the plan to update Pointer to the double precision complex input data (in GPU memory) to transform Pointer to the double precision complex output data (in GPU memory) The transform direction: CUFFT_FORWARD or CUFFT_INVERSE Output odata Contains the complex Fourier coefficients CUFFT successfully created the FFT plan. This function stores the Fourier coefficients in the odata array. If idata and odata are the same. Return Values CUFFT_SUCCESS CUFFT_INVALID_PLAN CUFFT_INVALID_VALUE CUFFT_INTERNAL_ERROR CUFFT_EXEC_FAILED CUFFT_SETUP_FAILED CUFFT_UNALIGNED_DATA 18 NVIDIA PG-05327-032_V01 . The idata. this method does an in place transform. Internal driver error is detected. CUFFT library failed to initialize.

CUDA CUFFT Library Function cufftExecD2Z() cufftResult cufftExecD2Z( cufftHandle plan. CUFFT failed to execute the transform on GPU. Executes a CUFFT double precision real to complex (implicitly forward) transform plan. The idata. odata.) Input plan idata odata The cufftHandle object for the plan to update Pointer to the double precision real input data (in GPU memory) to transform Pointer to the double precision complex output data (in GPU memory) Output odata Contains the complex Fourier coefficients CUFFT successfully created the FFT plan. Input or output does not satisfy texture alignment requirements. CUFFT library failed to initialize. cufftDoubleComplex *odata ). this method does an in place transform (See ““CUFFT Transform Types”” on page 7 for details on real data FFTs. CUFFT uses as input data the GPU memory pointed to by the idata parameter. This function stores the non redundant Fourier coefficients in the odata array. and/or direction parameter is not valid. The plan parameter is not a valid handle. cufftDoubleComplex *idata. cufftDoubleReal *odata ). PG-05327-032_V01 NVIDIA 19 . If idata and odata are the same. Return Values CUFFT_SUCCESS CUFFT_INVALID_PLAN CUFFT_INVALID_VALUE CUFFT_INTERNAL_ERROR CUFFT_EXEC_FAILED CUFFT_SETUP_FAILED CUFFT_UNALIGNED_DATA Function cufftExecZ2D() cufftResult cufftExecZ2D( cufftHandle plan. cufftDoubleReal *idata. Internal driver error is detected.

The idata. CUFFT library failed to initialize. CUFFT failed to execute the transform on GPU. All kernel launches made during plan execution are now done through the associated stream. and/or direction parameter is not valid. 20 NVIDIA PG-05327-032_V01 .) Input plan idata odata The cufftHandle object for the plan to update Pointer to the double precision complex input data (in GPU memory) to transform Pointer to the double precision real output data (in GPU memory) Contains the real valued output data CUFFT successfully created the FFT plan. Output odata Return Values CUFFT_SUCCESS CUFFT_INVALID_PLAN CUFFT_INVALID_VALUE CUFFT_INTERNAL_ERROR CUFFT_EXEC_FAILED CUFFT_SETUP_FAILED CUFFT_UNALIGNED_DATA Function cufftSetStream() cufftResult cufftSetStream( cufftHandle plan. (See ““CUFFT Transform Types”” on page 7 for details on real data FFTs. The plan parameter is not a valid handle. Internal driver error is detected. odata. If idata and odata are the same. cudaStream_t stream ). CUFFT uses as input data the GPU memory pointed to by the idata parameter. Input or output does not satisfy texture alignment requirements. Associates a CUDA stream with a CUFFT plan. this method does an in place transform.CUDA CUFFT Library Executes a CUFFT double precision complex to real (implicitly inverse) transform plan. enabling overlap with activity in other streams (for example. This function stores the real output values in the odata array. The input array holds only the non redundant complex Fourier coefficients.

CUDA CUFFT Library data copying). cufftCompatibility mode ). The plan parameter is not a valid handle. Input plan mode The cufftHandle object to associate with the stream The cufftCompatibility option to be used (see ““Type cufftCompatibility”” on page 7): CUFFT_COMPATIBILITY_NATIVE CUFFT_COMPATIBILITY_FFTW_PADDING (Default) CUFFT_COMPATIBILITY_FFTW_ASYMMETRIC CUFFT_COMPATIBILITY_FFTW_ALL Return Values CUFFT_SUCCESS CUFFT_INVALID_PLAN CUFFT_SETUP_FAILED CUFFT successfully executed the FFT plan. The association remains until the plan is destroyed or the stream is changed with another call to cufftSetStream(). CUFFT library failed to initialize. or to be fully compatible. The plan parameter is not a valid handle. it can be configured for padding only. Input plan stream The cufftHandle object to associate with the stream A valid CUDA stream created with cudaStreamCreate() (or 0 for the default stream) Contains the real valued output data The stream was associated with the plan. for asymmetric complex inputs only. Output odata Return Values CUFFT_SUCCESS CUFFT_INVALID_PLAN Function cufftSetCompatibilityMode() cufftResult cufftSetCompatibilityMode( cufftHandle plan. Configures the layout of CUFFT output in FFTW compatible modes. When FFTW compatibility is desired. PG-05327-032_V01 NVIDIA 21 .

5. where N is the transform size in points. 3. a transform of size 3n will likely be faster than one of size 2 * 3 . radix 5. and d are non negative integers) is a b c d optimized in the CUFFT library. the most efficient implementation is obtained by applying the following constraints (listed in order of the most generic to the most specialized constraint. This algorithm expresses a DFT recursively in terms of smaller DFT building blocks. with the relative error growing proportionally to log2(N). For other sizes. due to the accumulation of floating point operation inaccuracies. However. The accuracy of the Bluestein implementation degrades with larger sizes compared to the pure Cooley Tukey code path. 3. and radix 7. Hence the performance of any transform size that can be factored as 2 * 3 * 5 * 7 (where a. For sizes handled by the Cooley Tukey code path (that is. This aids with memory coalescing on Tesla class and Fermi class GPUs. radix 3. The CUFFT Library implements the following DFT building blocks: radix 2. even if the latter is slightly smaller. single dimensional transforms are handled by the Bluestein algorithm. On the other hand. specifically in single precision mode. Restrict the power of two factorization term of the X dimension to be at least a multiple of either 16 for single precision transforms or 8 for double precision transforms. Restrict the size along any dimension to be a multiple of 2. which is built on top of the Cooley Tukey algorithm. with each subsequent constraint providing the potential of an additional performance improvement). thereby.CUDA CUFFT Library Accuracy and Performance A general DFT can be implemented as a matrix vector multiplication that requires O(N2) operations. strictly multiples of 2. c. 5. to optimize the performance of particular transform sizes. or 7 only. the pure Cooley Tukey implementation has excellent accuracy. the CUFFT Library employs the Cooley Tukey algorithm to reduce the number of required operations and. For example. b. Restrict the power of two factorization term of the X dimension to be a multiple of either 256 for single precision transforms or 64 for double i j 22 NVIDIA PG-05327-032_V01 . and 7).

Restrict the X dimension of single precision transforms to be strictly a power of two between either 2 and 2048 for Tesla class GPUs or 2 and 8192 for Fermi class GPUs. CUFFT Code Examples This section provides six simple examples of 1D. This further aids with memory coalescing on Tesla class and Fermi class GPUs. Starting with version 3. The examples are as follows: ““1D Complex to Complex Transforms”” on page 24 ““1D Real to Complex Transforms”” on page 25 ““2D Complex to Complex Transforms”” on page 26 ““Batched 2D Complex to Complex Transforms”” on page 27 ““2D Complex to Real Transforms”” on page 28 ““3D Complex to Complex Transforms”” on page 29 PG-05327-032_V01 NVIDIA 23 . the conjugate symmetry property of real to complex output data arrays and complex to real input data arrays is exploited. specifically. Large 1D sizes (powers of two larger than 65.536) and 2D and 3D transforms benefit the most from the performance optimizations in the implementation of real to complex or complex to real transforms. and 3D complex and real data transforms that use the CUFFT to perform forward and inverse FFTs. when the power of two factorization term of the X dimension is at least a multiple of 4. These transforms are implemented as specialized hand coded kernels that keep all intermediate results in shared memory.CUDA CUFFT Library precision transforms. 2D.1 of the CUFFT Library.

*/ cufftExecC2C(plan. 24 NVIDIA PG-05327-032_V01 . cufftComplex *data. /* Create a 1D FFT plan. */ cufftDestroy(plan). */ cufftPlan1d(&plan. NX. sizeof(cufftComplex)*NX*BATCH). data. BATCH). cudaFree(data). /* Inverse transform the signal in place. */ cufftExecC2C(plan.CUDA CUFFT Library 1D Complex-to-Complex Transforms #define NX 256 #define BATCH 10 cufftHandle plan. CUFFT_C2C. data. /* Use the CUFFT plan to transform the signal in place. data. CUFFT_INVERSE). /* Note: (1) Divide by number of elements in data set to get back original data (2) Identical pointers to input and output arrays implies in place transformation */ /* Destroy the CUFFT plan. cudaMalloc((void**)&data. data. CUFFT_FORWARD).

CUDA CUFFT Library 1D Real-to-Complex Transforms #define NX 256 #define BATCH 10 cufftHandle plan. */ cufftDestroy(plan). PG-05327-032_V01 NVIDIA 25 . */ cufftExecR2C(plan. NX. (cufftReal*)data. CUFFT_R2C. cudaMalloc((void**)&data. cudaFree(data). /* Destroy the CUFFT plan. /* Create a 1D FFT plan. BATCH). /* Use the CUFFT plan to transform the signal in place. */ cufftPlan1d(&plan. cufftComplex *data. data). sizeof(cufftComplex)*(NX/2+1)*BATCH).

cufftComplex *idata. cudaMalloc((void**)&odata. CUFFT_FORWARD). */ /* Inverse transform the signal in place */ cufftExecC2C(plan. cudaMalloc((void**)&idata. 26 NVIDIA PG-05327-032_V01 . *odata. odata. /* Use the CUFFT plan to transform the signal out of place. /* Create a 2D FFT plan. /* Destroy the CUFFT plan. CUFFT_INVERSE). sizeof(cufftComplex)*NX*NY). /* Note: idata != odata indicates an out of place transformation to CUFFT at execution time. NY. CUFFT_C2C).CUDA CUFFT Library 2D Complex-to-Complex Transforms #define NX 256 #define NY 128 cufftHandle plan. odata. odata. */ cufftDestroy(plan). cudaFree(idata). idata. sizeof(cufftComplex)*NX*NY). NX. */ cufftExecC2C(plan. */ cufftPlan2d(&plan. cudaFree(odata).

cudaMalloc((void **)&outdata. sizeof(cufftComplex)*datalen). NY }. datalen = NX * NY * BATCHSIZE. PG-05327-032_V01 NVIDIA 27 . cudaFree(outdata). *outdata. sizeof(cufftComplex)*datalen).NULL. cudaMalloc((void **)&indata.1. cudaFree(indata). cufftComplex *indata.0. /* Destroy the CUFFT plan */ cufftDestroy(plan).{ NX. /* Execute the transform out of place */ cufftExecC2C(plan. indata. /* Create a batched 2D plan */ cufftPlanMany(&plan. cufftHandle plan.1.0.CUFFT_C2C.2. outdata.CUDA CUFFT Library Batched 2D Complex-to-Complex Transforms #define NX 128 #define NY 256 #define BATCHSIZE 1000 int datalen.NULL.BATCHSIZE). CUFFT_FORWARD).

/* Create a 2D FFT plan. /* Use the CUFFT plan to transform the signal out of place. */ cufftPlan2d(&plan. idata. NY. cudaFree(odata).CUDA CUFFT Library 2D Complex-to-Real Transforms #define NX 256 #define NY 128 cufftHandle plan. 28 NVIDIA PG-05327-032_V01 . sizeof(cufftComplex)*NX*NY). cudaFree(idata). */ cufftExecC2R(plan. cudaMalloc((void**)&idata. sizeof(cufftReal)*NX*NY). cufftReal *odata. /* Destroy the CUFFT plan. cufftComplex *idata. cudaMalloc((void**)&odata. NX. */ cufftDestroy(plan). CUFFT_C2R). odata).

CUDA CUFFT Library 3D Complex-to-Complex Transforms #define NX 64 #define NY 64 #define NZ 128 cufftHandle plan. data2. cufftComplex *data1. sizeof(cufftComplex)*NX*NY*NZ). sizeof(cufftComplex)*NX*NY*NZ). PG-05327-032_V01 NVIDIA 29 . NZ. data1. cudaFree(data2). NX. /* Create a 3D FFT plan. *data2. */ cufftExecC2C(plan. cudaMalloc((void**)&data1. NY. data1. CUFFT_FORWARD). */ cufftDestroy(plan). */ cufftExecC2C(plan. /* Transform the second signal using the same plan. /* Destroy the CUFFT plan. /* Transform the first signal in place. cudaFree(data1). cudaMalloc((void**)&data2. CUFFT_C2C). CUFFT_FORWARD). data2. */ cufftPlan3d(&plan.