You are on page 1of 14

CUDA Toolkit 3.

2
Math Library Performance
January, 2011
CUDA Toolkit 3.2 Libraries
NVIDIA GPU-accelerated math libraries:
cuFFT – Fast Fourier Transforms Library
cuBLAS – Complete BLAS library
cuSPARSE – Sparse Matrix Library
cuRAND – Random Number Generation (RNG) Library

For more information on CUDA libraries:


http://www.nvidia.com/object/gtc2010-presentation-archive.html#session2216

2
cuFFT
Multi-dimensional Fast Fourier Transforms
New in CUDA 3.2 :
Higher performance of 1D, 2D, 3D transforms with
dimensions of powers of 2, 3, 5 or 7
Higher performance and accuracy for 1D transform sizes
that contain large prime factors

3
cuFFT 3.2 up to 8.8x Faster than MKL
1D used in audio processing and as a foundation for 2D and 3D FFTs
Single-Precision 1D Radix-2
Double-Precision 1D Radix-2
400
 CUFFT 120  CUFFT
350  MKL  MKL
 FFTW 100
 FFTW
300
GFLOPS

GFLOPS
250 80

200 60
150
40
100
20
50

0 0
6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23
N (transform size = 2^N) N (transform size = 2^N)

* cuFFT 3.2, Tesla C2050 (Fermi) with ECC on


* MKL 10.1r1, 4-core Corei7 Nehalem @ 3.07GHz
Performance may vary based on OS version and motherboard configuration * FFTW single-thread on same CPU 4
cuFFT 1D Radix-3 up to 18x Faster than MKL
GFLOPS
18x for single-precision, 15x GFLOPS
for double-precision
Similar acceleration for radix-5 and -7
Single-Precision Double-Precision
240
 CUFFT (ECC off) 80  CUFFT (ECC off)
220  CUFFT (ECC on)  CUFFT (ECC on)
70
200  MKL  MKL
180 60
160
GFLOPS

GFLOPS
50
140
120 40
100
30
80
60 20
40
10
20
0 0
2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
N (transform size = 3^N) N (transform size = 3^N)
* CUFFT 3.2 on Tesla C2050 (Fermi)
Performance may vary based on OS version and motherboard configuration * MKL 10.1r1, 4-core Corei7 Nehalem @ 3.07GHz 5
cuFFT 3.2 up to 100x Faster than cuFFT 3.1

Log Scale
Speedup for C2C single-precision
cuFFT 3.2 vs. cuFFT 3.1 on Tesla C2050 (Fermi)

100x
Speedup

10x

1x

0x
0 2048 4096 6144 8192 10240 12288 14336 16384
1D Transform Size
Performance may vary based on OS version and motherboard configuration 6
cuBLAS: Dense Linear Algebra on GPUs

Complete BLAS implementation


Supports all 152 routines for single, double, complex and
double complex

New in CUDA 3.2


7x Faster GEMM (matrix multiply) on Fermi GPUs
Higher performance on SGEMM & DGEMM for all matrix sizes
and all transpose cases (NN, TT, TN, NT)

7
Up to 8x Speedup for all GEMM Types
GEMM Performance on 4K by 4K matrices
900
800 775
CUBLAS3.2
700 636
600
GFLOPS

500
400
301 295
300
200
100 78 80
39 40
0
SGEMM CGEMM DGEMM ZGEMM

* cuBLAS 3.2, Tesla C2050 (Fermi), ECC on


Performance may vary based on OS version and motherboard configuration * MKL 10.2.3, 4-core Corei7 @ 2.66Ghz 8
cuBLAS 3.2 : Consistently 7x Faster DGEMM Performance
DGEMM 3.2 DGEMM 3.1 DGEMM MKL 4 THREADS
350
cuBLAS 3.2
300 8%
250 cuBLAS 3.1
GFLOPS

200
Large perf
150 variance in
7x Faster than MKL cuBLAS 3.1
100

50 MKL 10.2.3

Dimension (m = n = k)
* cuBLAS 3.2, Tesla C2050 (Fermi), ECC on
Performance may vary based on OS version and motherboard configuration * MKL 10.2.3, 4-core Corei7 @ 2.66Ghz 9
cuSPARSE

New library for sparse linear algebra


Conversion routines for dense, COO, CSR and CSC formats
Optimized sparse matrix-vector multiplication

y1 1.0 1.0 y1
y2 \alpha 2.0 3.0 2.0 + \beta y2
y3 4.0 3.0 y3
y4 5.0 6.0 7.0 4.0 y4

10
Log Scale
32x Faster single
64 1 Sparse Matrix x 6 Dense Vector double
complex
Speedup of cuSPARSE vs MKL double-complex
Speedup over MKL

32
16
8
4
2
1

* CUSPARSE 3.2, NVIDIA C2050 (Fermi), ECC on


Performance may vary based on OS version and motherboard configuration * MKL 10.2.3, 4-core Corei7 @ 3.07GHz 11
Performance of 1 Sparse Matrix x 6 Dense Vectors
160
140 single
120 double
complex
GFLOPS

100 double-complex
80
60
40
20
0

Test cases roughly in order of increasing sparseness

Performance may vary based on OS version and motherboard configuration * CUSPARSE 3.2, NVIDIA C2050 (Fermi), ECC on 12
cuRAND

Library for generating random numbers

Features supported in CUDA 3.2


XORWOW pseudo-random generator
Sobol’ quasi-random number generators
Host API for generating random numbers in bulk
Inline implementation allows use inside GPU functions/kernels
Single- and double-precision, uniform and normal distributions

13
cuRAND performance

XORWOW Psuedo-RNG Sobol’ Quasi-RNG (1 dimension)


18 18
16 16
GigaSamples / second

14 14
12 12
10 10
8 8
6 6
4 4
2 2
0 0
uniform uniform uniform normal normal uniform uniform uniform normal normal
int float double float double int float double float double

Performance may vary based on OS version and motherboard configuration * CURAND 3.2, NVIDIA C2050 (Fermi), ECC on 14

You might also like