CUDA 3.2 Math Libraries Performance

CUDA Toolkit 3.
2
Math Library Performance
January, 2011
CUDA Toolkit 3.2 Libraries
NVIDIA GPU-accelerated math libraries:
cuFFT – Fast Fourier Transforms Library
cuBLAS – Complete BLAS library
cuSPARSE – Sparse Matrix Library
cuRAND – Random Number Generation (RNG) Library
For more information on CUDA libraries:

http://www.nvidia.com/object/gtc2010-presentation-archive.html#session2216
2
cuFFT
Multi-dimensional Fast Fourier Transforms
New in CUDA 3.2 :
Higher performance of 1D, 2D, 3D transforms with
dimensions of powers of 2, 3, 5 or 7
Higher performance and accuracy for 1D transform sizes
that contain large prime factors
3
cuFFT 3.2 up to 8.8x Faster than MKL
1D used in audio processing and as a foundation for 2D and 3D FFTs
Single-Precision 1D Radix-2
Double-Precision 1D Radix-2
400
 CUFFT 120  CUFFT
350  MKL  MKL
 FFTW 100
 FFTW
300
GFLOPS
GFLOPS
250 80
200 60
150
40
100
20
50
0 0
6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23
N (transform size = 2^N) N (transform size = 2^N)
* cuFFT 3.2, Tesla C2050 (Fermi) with ECC on

* MKL 10.1r1, 4-core Corei7 Nehalem @ 3.07GHz
Performance may vary based on OS version and motherboard configuration * FFTW single-thread on same CPU 4
cuFFT 1D Radix-3 up to 18x Faster than MKL
GFLOPS
18x for single-precision, 15x GFLOPS
for double-precision
Similar acceleration for radix-5 and -7
Single-Precision Double-Precision
240
 CUFFT (ECC off) 80  CUFFT (ECC off)
220  CUFFT (ECC on)  CUFFT (ECC on)
70
200  MKL  MKL
180 60
160
GFLOPS
GFLOPS
50
140
120 40
100
30
80
60 20
40
10
20
0 0
2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
N (transform size = 3^N) N (transform size = 3^N)
* CUFFT 3.2 on Tesla C2050 (Fermi)
Performance may vary based on OS version and motherboard configuration * MKL 10.1r1, 4-core Corei7 Nehalem @ 3.07GHz 5
cuFFT 3.2 up to 100x Faster than cuFFT 3.1
Log Scale
Speedup for C2C single-precision
cuFFT 3.2 vs. cuFFT 3.1 on Tesla C2050 (Fermi)
100x
Speedup
10x
1x
0x
0 2048 4096 6144 8192 10240 12288 14336 16384
1D Transform Size
Performance may vary based on OS version and motherboard configuration 6
cuBLAS: Dense Linear Algebra on GPUs
Complete BLAS implementation

Supports all 152 routines for single, double, complex and
double complex
New in CUDA 3.2

7x Faster GEMM (matrix multiply) on Fermi GPUs
Higher performance on SGEMM & DGEMM for all matrix sizes
and all transpose cases (NN, TT, TN, NT)
7
Up to 8x Speedup for all GEMM Types
GEMM Performance on 4K by 4K matrices
900
800 775
CUBLAS3.2
700 636
600
GFLOPS
500
400
301 295
300
200
100 78 80
39 40
0
SGEMM CGEMM DGEMM ZGEMM
* cuBLAS 3.2, Tesla C2050 (Fermi), ECC on

Performance may vary based on OS version and motherboard configuration * MKL 10.2.3, 4-core Corei7 @ 2.66Ghz 8
cuBLAS 3.2 : Consistently 7x Faster DGEMM Performance
DGEMM 3.2 DGEMM 3.1 DGEMM MKL 4 THREADS
350
cuBLAS 3.2
300 8%
250 cuBLAS 3.1
GFLOPS
200
Large perf
150 variance in
7x Faster than MKL cuBLAS 3.1
100
50 MKL 10.2.3
Dimension (m = n = k)
* cuBLAS 3.2, Tesla C2050 (Fermi), ECC on
Performance may vary based on OS version and motherboard configuration * MKL 10.2.3, 4-core Corei7 @ 2.66Ghz 9
cuSPARSE
New library for sparse linear algebra

Conversion routines for dense, COO, CSR and CSC formats
Optimized sparse matrix-vector multiplication
y1 1.0 1.0 y1
y2 \alpha 2.0 3.0 2.0 + \beta y2
y3 4.0 3.0 y3
y4 5.0 6.0 7.0 4.0 y4
10
Log Scale
32x Faster single
64 1 Sparse Matrix x 6 Dense Vector double
complex
Speedup of cuSPARSE vs MKL double-complex
Speedup over MKL
32
16
8
4
2
1
* CUSPARSE 3.2, NVIDIA C2050 (Fermi), ECC on

Performance may vary based on OS version and motherboard configuration * MKL 10.2.3, 4-core Corei7 @ 3.07GHz 11
Performance of 1 Sparse Matrix x 6 Dense Vectors
160
140 single
120 double
complex
GFLOPS
100 double-complex
80
60
40
20
0
Test cases roughly in order of increasing sparseness
Performance may vary based on OS version and motherboard configuration * CUSPARSE 3.2, NVIDIA C2050 (Fermi), ECC on 12
cuRAND
Library for generating random numbers
Features supported in CUDA 3.2

XORWOW pseudo-random generator
Sobol’ quasi-random number generators
Host API for generating random numbers in bulk
Inline implementation allows use inside GPU functions/kernels
Single- and double-precision, uniform and normal distributions
13
cuRAND performance
XORWOW Psuedo-RNG Sobol’ Quasi-RNG (1 dimension)

18 18
16 16
GigaSamples / second
14 14
12 12
10 10
8 8
6 6
4 4
2 2
0 0
uniform uniform uniform normal normal uniform uniform uniform normal normal
int float double float double int float double float double
Performance may vary based on OS version and motherboard configuration * CURAND 3.2, NVIDIA C2050 (Fermi), ECC on 14

CUDA 3.2 Math Libraries Performance

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

CUDA 3.2 Math Libraries Performance

Uploaded by

Copyright:

Available Formats

CUDA Toolkit 3.

For more information on CUDA libraries:

* cuFFT 3.2, Tesla C2050 (Fermi) with ECC on

Complete BLAS implementation

New in CUDA 3.2

* cuBLAS 3.2, Tesla C2050 (Fermi), ECC on

New library for sparse linear algebra

* CUSPARSE 3.2, NVIDIA C2050 (Fermi), ECC on

Test cases roughly in order of increasing sparseness

Library for generating random numbers

Features supported in CUDA 3.2

XORWOW Psuedo-RNG Sobol’ Quasi-RNG (1 dimension)

You might also like