Professional Documents
Culture Documents
2
Math Library Performance
January, 2011
CUDA Toolkit 3.2 Libraries
NVIDIA GPU-accelerated math libraries:
cuFFT – Fast Fourier Transforms Library
cuBLAS – Complete BLAS library
cuSPARSE – Sparse Matrix Library
cuRAND – Random Number Generation (RNG) Library
2
cuFFT
Multi-dimensional Fast Fourier Transforms
New in CUDA 3.2 :
Higher performance of 1D, 2D, 3D transforms with
dimensions of powers of 2, 3, 5 or 7
Higher performance and accuracy for 1D transform sizes
that contain large prime factors
3
cuFFT 3.2 up to 8.8x Faster than MKL
1D used in audio processing and as a foundation for 2D and 3D FFTs
Single-Precision 1D Radix-2
Double-Precision 1D Radix-2
400
CUFFT 120 CUFFT
350 MKL MKL
FFTW 100
FFTW
300
GFLOPS
GFLOPS
250 80
200 60
150
40
100
20
50
0 0
6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23
N (transform size = 2^N) N (transform size = 2^N)
GFLOPS
50
140
120 40
100
30
80
60 20
40
10
20
0 0
2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
N (transform size = 3^N) N (transform size = 3^N)
* CUFFT 3.2 on Tesla C2050 (Fermi)
Performance may vary based on OS version and motherboard configuration * MKL 10.1r1, 4-core Corei7 Nehalem @ 3.07GHz 5
cuFFT 3.2 up to 100x Faster than cuFFT 3.1
Log Scale
Speedup for C2C single-precision
cuFFT 3.2 vs. cuFFT 3.1 on Tesla C2050 (Fermi)
100x
Speedup
10x
1x
0x
0 2048 4096 6144 8192 10240 12288 14336 16384
1D Transform Size
Performance may vary based on OS version and motherboard configuration 6
cuBLAS: Dense Linear Algebra on GPUs
7
Up to 8x Speedup for all GEMM Types
GEMM Performance on 4K by 4K matrices
900
800 775
CUBLAS3.2
700 636
600
GFLOPS
500
400
301 295
300
200
100 78 80
39 40
0
SGEMM CGEMM DGEMM ZGEMM
200
Large perf
150 variance in
7x Faster than MKL cuBLAS 3.1
100
50 MKL 10.2.3
Dimension (m = n = k)
* cuBLAS 3.2, Tesla C2050 (Fermi), ECC on
Performance may vary based on OS version and motherboard configuration * MKL 10.2.3, 4-core Corei7 @ 2.66Ghz 9
cuSPARSE
y1 1.0 1.0 y1
y2 \alpha 2.0 3.0 2.0 + \beta y2
y3 4.0 3.0 y3
y4 5.0 6.0 7.0 4.0 y4
10
Log Scale
32x Faster single
64 1 Sparse Matrix x 6 Dense Vector double
complex
Speedup of cuSPARSE vs MKL double-complex
Speedup over MKL
32
16
8
4
2
1
100 double-complex
80
60
40
20
0
Performance may vary based on OS version and motherboard configuration * CUSPARSE 3.2, NVIDIA C2050 (Fermi), ECC on 12
cuRAND
13
cuRAND performance
14 14
12 12
10 10
8 8
6 6
4 4
2 2
0 0
uniform uniform uniform normal normal uniform uniform uniform normal normal
int float double float double int float double float double
Performance may vary based on OS version and motherboard configuration * CURAND 3.2, NVIDIA C2050 (Fermi), ECC on 14