P. 1
CUDA C Programming Guide

CUDA C Programming Guide

|Views: 326|Likes:
Published by maurolw

More info:

Published by: maurolw on Apr 06, 2012
Copyright:Attribution Non-commercial


Read on Scribd mobile: iPhone, iPad and Android.
download as PDF, TXT or read online from Scribd
See more
See less





To maximize instruction throughput the application should:
Minimize the use of arithmetic instructions with low throughput; this includes
trading precision for speed when it does not affect the end result, such as using
intrinsic instead of regular functions (intrinsic functions are listed in
Section C.2), single-precision instead of double-precision, or flushing
denormalized numbers to zero;
Minimize divergent warps caused by control flow instructions as detailed in
Section 5.4.2;

Chapter 5. Performance Guidelines


CUDA C Programming Guide Version 4.1

Reduce the number of instructions, for example, by optimizing out
synchronization points whenever possible as described in Section 5.4.3 or by
using restricted pointers as described in Section B.2.4.
In this section, throughputs are given in number of operations per clock cycle per
multiprocessor. For a warp size of 32, one instruction results in 32 operations.
Therefore, if T is the number of operations per clock cycle, the instruction
throughput is one instruction every 32/T clock cycles.
All throughputs are for one multiprocessor. They must be multiplied by the number
of multiprocessors in the device to get throughput for the whole device.

You're Reading a Free Preview

/*********** DO NOT ALTER ANYTHING BELOW THIS LINE ! ************/ var s_code=s.t();if(s_code)document.write(s_code)//-->