Welcome to Scribd, the world's digital library. Read, publish, and share books and documents. See more
Standard view
Full view
of .
Save to My Library
Look up keyword or section
Like this

Table Of Contents

Optimize Memory Access
Take Advantage of Shared Memory
Use Parallelism Efficiently
10-Series Architecture
Execution Model
Warps and Half Warps
Host-Device Data Transfers
Page-Locked Data Transfers
Overlapping Data Transfers and Computation
Asynchronous Data Transfers
Overlapping kernel and data transfer
Zero copy
Zero copy considerations
Theoretical Bandwidth
Effective Bandwidth
Shared Memory
Shared Memory Architecture
Shared memory bank conflicts
Shared Memory Example: Transpose
Naïve Transpose
Bank Conflicts in Transpose
Textures in CUDA
Texture Addressing
CUDA Texture Types
CUDA Texturing Steps
Texture Example
Blocks per Grid Heuristics
Register Dependency
Register Pressure
Occupancy Calculator
Optimizing threads per block
Occupancy != Performance
Parameterize Your Application
CUDA Instruction Performance
Maximizing Instruction Throughput
Arithmetic Instruction Throughput
Runtime Math Library
GPU results may not match CPU
FP Math is Not Associative!
Control Flow Instructions
0 of .
Results for:
No results containing your search query
P. 1


Ratings: (0)|Views: 87 |Likes:
Published by piurani

More info:

Published by: piurani on Sep 08, 2011
Copyright:Attribution Non-commercial


Read on Scribd mobile: iPhone, iPad and Android.
download as PDF, TXT or read online from Scribd
See more
See less





You're Reading a Free Preview
Pages 5 to 71 are not shown in this preview.
You're Reading a Free Preview
Pages 76 to 81 are not shown in this preview.

You're Reading a Free Preview

/*********** DO NOT ALTER ANYTHING BELOW THIS LINE ! ************/ var s_code=s.t();if(s_code)document.write(s_code)//-->